DeepMind’s bid for one architecture that eats any input — images, audio, video, point clouds — and produces any output, by funnelling a huge input through a small latent bottleneck with cross-attention. No convolutions, no patches, no modality-specific front-ends.
The transformer is a beautiful, general idea: let every element attend to every other element. But two walls stop it from being the universal perception machine it looks like it should be. The Perceiver was built to knock both down.
Self-attention compares every element with every other element. For n elements that’s n² comparisons. For a sentence of 500 tokens, fine. But a single 224×224 image is 50,176 pixels — and 50,176 squared is about 2.5 billion pairwise scores, per layer, per head. Raw audio is hundreds of thousands of samples. Video is millions. You simply cannot run plain self-attention directly on the raw signal. Transformers only work on images at all because we cheat first — chopping the image into a few hundred patches.
And that cheat is the second wall. Every modality gets a hand-built front-end: convolutions or patch-embeddings for images, spectrograms or strided convs for audio, tokenizers for text, special encoders for point clouds. Each assumes a particular structure of its input. There is no single network you can point at anything. We’ve built a zoo of bespoke architectures, one per data type.
Slide the input size. A plain transformer’s self-attention cost (orange) grows with the square of the input; the Perceiver’s cross-attention (teal) grows only linearly. Watch the gap become absurd at image scale.
The trick that knocks down both walls at once is almost embarrassingly simple: don’t let the huge input attend to itself. Instead, give the network a small, fixed set of latent vectors, and let those read from the input. The next chapter introduces this bottleneck; the rest of the lesson builds the whole Perceiver and Perceiver IO around it.
Here is the central object of the whole architecture: a small, fixed array of latent vectors. Think of it as the model’s workspace — a mental scratchpad. Where the input might have 50,000 elements, the latent array has just a few hundred, say 512, each a vector of some width D. And critically, the number of latents is chosen by you, completely decoupled from the input size.
The latents are learned — they start as trainable parameters (like a fixed set of question-vectors the network always brings to any input) and are refined by attention. They are the same regardless of whether the input is an image, a sound, or a point cloud. The input pours into this small workspace; all the heavy thinking happens inside the workspace; and the answer is read back out of it.
This single design decision — a fixed-size latent workspace that’s smaller than the input — is what makes everything else possible. It is an information bottleneck (the latents must compress the whole input into 512 vectors) and a computational bottleneck (all the expensive self-attention happens on 512, not 50,000). The art is in how information gets into the latents and out of them — and that is just attention, which we build next.
The big column (orange) is the raw input — thousands of elements. The small column (teal) is the fixed latent workspace. Slide the input size: the input grows, the workspace stays exactly the same. That fixed size is the whole point.
How does the input get into the latents? With cross-attention — the same attention you know, but asymmetric. In ordinary self-attention, the queries, keys, and values all come from the same set. In cross-attention, they come from different sets: the queries come from the latents (few), and the keys and values come from the input (many).
So each of the 512 latent vectors asks a question (its query) and gathers a weighted summary from all 50,000 input elements (their keys and values). The attention matrix has shape latents × inputs — 512 by 50,000 — not inputs by inputs. The cost is the number of latents times the number of inputs: linear in the input size, because the latent count is fixed.
Take 2 latents reading from 3 input elements, in a tiny 1-D world. Latent 1’s query is q = [1, 0]. The three inputs have keys k = [1,0], [0,1], [1,1] and values v = 10, 20, 30. Compute latent 1’s output (skip the √d scaling for clarity).
| input | key | score = q·k | softmax | value |
|---|---|---|---|---|
| 1 | [1,0] | 1 | 0.506 | 10 |
| 2 | [0,1] | 0 | 0.186 | 20 |
| 3 | [1,1] | 1 | 0.308 | 30 |
Softmax of [1, 0, 1] = [0.506, 0.186, 0.308] (the two 1’s tie, the 0 gets less). Output = 0.506·10 + 0.186·20 + 0.308·30 = 5.06 + 3.72 + 9.24 = 18.0. Latent 1 has distilled all three inputs into a single number, weighted by how well each input’s key matched its query. Repeat for latent 2 with its own query — and the entire 3-element input has been compressed into 2 latent values. Scale that to 512 latents reading 50,000 inputs and you have the Perceiver’s encoder.
Left column = latents (queries), right column = input elements (keys/values). Click a latent to see how strongly it attends to each input (line thickness = attention weight). The attention matrix is small-by-large, never large-by-large.
Once the input is distilled into 512 latents, the real computation begins — and here we go back to ordinary self-attention, but applied only to the latents. Stack a deep transformer (many self-attention + MLP layers) that operates purely on the 512-vector workspace. The latents attend to each other, refine each other, build up an understanding.
And now the magic of the bottleneck pays off. This deep stack costs latents squared per layer — 512² ≈ 260,000 — completely independent of the input size. Whether the input had 4,000 or 4,000,000 elements, the thinking costs exactly the same. You have decoupled network depth from input size. Want a 48-layer transformer? Its cost depends only on your chosen latent count, never on the data.
For an image (50,176 inputs), 512 latents, 8 processing layers, the rough operation counts are:
| stage | cost formula | order of magnitude |
|---|---|---|
| plain transformer (1 layer) | inputs² | ≈ 2,500,000,000 |
| Perceiver cross-attn (read) | latents × inputs | ≈ 26,000,000 |
| Perceiver self-attn (8 layers) | 8 × latents² | ≈ 2,000,000 |
Even with eight layers of thinking, the Perceiver does roughly 100× less work than a single layer of the naive transformer — and stays flat as the image grows.
Total cost as you add processing depth, for two input sizes. The two Perceiver curves (teal/blue) rise gently with depth and barely separate — input size hardly matters. The transformer (orange, single layer) towers over both before you even add depth.
There’s a subtlety hiding in cross-attention. Attention is a set operation — it has no built-in sense of order. To the cross-attention, the pixels of an image are just an unordered bag; shuffle them and the output is identical. But position matters enormously: a pixel in the top-left is different from one in the bottom-right. So we must tell the model where each input element lives.
The Perceiver does this by concatenating positional features onto each input element before it’s read. It uses Fourier features — a bank of sines and cosines at many frequencies, evaluated at each element’s coordinates. High frequencies distinguish nearby positions; low frequencies encode coarse location. Each pixel becomes “my color values, plus a code for exactly where I am.”
And here is the beautiful consequence: this is the only thing that changes between modalities. For audio, you attach 1-D position (time). For images, 2-D position (row, column). For video, 3-D (row, column, frame). For a point cloud, 3-D spatial coordinates. The network is identical — only the positional features you tack on differ. No convolutions, no patches, no tokenizers. That is what “modality-agnostic” really means: the architecture makes no structural assumptions, and you inject structure purely through position features.
Each curve is one sine/cosine frequency band. Drag the position: the dots show that element’s feature vector — a unique fingerprint of sines/cosines. Many frequencies give a multi-scale code that pins down location exactly. Toggle 1-D (audio) vs 2-D (image).
One pass of cross-attention gives the latents a first glance at the input. But a single glance may miss things — like skimming a page once. The Perceiver can re-attend to the input multiple times, interleaving: cross-attend (look at the input), then several self-attention layers (think), then cross-attend again (look again, now with a better idea of what to look for), and so on. This is the “iterative” in iterative attention.
Because each look uses the same kind of operation, the Perceiver often shares the weights across these repeated cross-attend/think blocks — the same parameters applied over and over. This makes the architecture behave like a recurrent network unrolled in depth: a fixed set of weights, applied iteratively, refining the latent state each round.
There is a tension to manage: the first cross-attention is usually given its own weights (the initial read is special), while later repeats share. Too many shared repeats and the latents stop improving (diminishing returns, like re-reading a page you’ve memorized); too few and they never fully absorb the input. The number of iterations is a knob you tune.
The pipeline unrolled: each round is a cross-attend (read input) + self-attend (think). Toggle weight sharing — shared rounds reuse one block of parameters (same color), like an unrolled RNN. Drag the iteration count.
The original Perceiver had one weakness: it could only produce simple outputs. To classify, it just averaged the latents into one vector and ran a classifier. Fine for “is this a cat?”, useless for “label every pixel” or “generate this sentence.” Perceiver IO (2021) fixes the output side with the exact same idea it used for the input side: cross-attention.
To produce outputs, you supply a query array — one query vector per output element you want. Then you do one more cross-attention, but flipped: now the output queries are the queries, and the latents are the keys and values. Each output query reads from the latent workspace and produces one output element.
This is wonderfully flexible. Want a per-pixel output (optical flow, segmentation)? Make one query per pixel, each encoding that pixel’s position. Want a sentence? One query per output token position. Want a single class? One query. Want multiple tasks at once? Concatenate their query sets. The number and meaning of outputs is set entirely by the query array — the network doesn’t change. And the cost is outputs × latents — linear in the number of outputs.
The latent workspace (center) is fixed. Slide the number of output queries: each query (right) cross-attends to the latents to produce one output element. Few queries → a class label; many → per-pixel maps or a sentence. The same latents serve any output shape.
Assemble the whole thing: encode (cross-attention reads the huge input into the latents), process (deep self-attention thinks in the latents), decode (output queries cross-attend to read the answer out). The simulator runs the full pipeline and tallies the cost against a plain transformer as you scale every dimension.
Set the input size, latent count, processing depth, and number of outputs. The diagram shows the three stages; the readout compares total operations against a plain transformer doing the same job. Push the input size up and watch the transformer’s cost explode while the Perceiver’s barely moves. Then shrink the latents too far and watch the bottleneck choke.
Notice what the readout teaches. The transformer’s cost is dominated by input-squared — one term, and it’s catastrophic. The Perceiver splits its cost into three modest, separable terms: read (input×latents), think (depth×latents²), write (outputs×latents). Each is linear in the thing you scale. That separability is the architecture’s superpower — you can grow the input, the depth, or the output independently without any of them multiplying together.
The Perceiver’s gift is linear scaling in input and output, with all the expensive depth confined to the latent workspace. But generality and bottlenecks have costs too — understanding them tells you when to reach for it.
Squeezing 50,000 inputs into 512 latents is a compression. If the task needs fine, distributed detail from everywhere in the input simultaneously, 512 latents may not have the capacity to hold it — information is lost at the bottleneck. Too few latents and the model underperforms; too many and you creep back toward the quadratic cost you were avoiding. The latent count is the central dial trading capacity against cost.
Because it bakes in no convolutional locality, the Perceiver must learn spatial structure from position features — so on pure vision tasks with limited data, a ConvNet or ViT with built-in locality can be more sample-efficient. The Perceiver shines when you have lots of data, multiple modalities, very large inputs, or unusual output shapes — situations where bespoke priors don’t exist or don’t fit.
The latent-bottleneck-via-cross-attention idea spread far beyond the original papers. The most famous descendant is the Perceiver Resampler in DeepMind’s Flamingo: it cross-attends a variable, large number of vision features down to a fixed small set of tokens that a language model can consume — exactly the encode step, used as a bridge between a vision encoder and an LLM. Many modern multimodal models use this “resampler” pattern to glue modalities together.
Total compute as the input grows. Transformer (orange): quadratic, unbounded. Perceiver (teal): linear — a gentle slope set by the latent count. The crossover is early; past it, the gap is everything.
The whole architecture, and how it sits in the family.
| Model | Handles large input? | Modality-agnostic? | Arbitrary output? | Key trick |
|---|---|---|---|---|
| Transformer | no (n²) | no | seq only | full self-attention |
| ViT | via patches | no (images) | class / patch | patch embed |
| Perceiver | yes (linear) | yes | simple (avg) | latent cross-attn |
| Perceiver IO | yes | yes | any (queries) | + output query decode |
| Set Transformer | yes | sets | pooled | inducing points |
| DETR | via CNN | detection | object queries | learned queries |
The unifying thread: a small set of learned query vectors that cross-attend to a large input is a recurring superpower — Perceiver latents, DETR object queries, Set Transformer inducing points, Flamingo’s resampler are all the same idea. Cross-attention into a bottleneck is one of the most reusable patterns in modern architecture design.
→ Attention Variants — cross-attention, MQA, FlashAttention in depth
→ Vision Transformer — the patch-based approach Perceiver sidesteps
→ Linear Attention & RWKV — another route around the n² wall
→ Vision-Language Models — where the resampler bridges vision and text
→ JEPA — another fixed-bottleneck idea, for self-supervised prediction