A tiny transformer can encode patterns that would require an exponentially larger RNN or a doubly exponentially larger automaton to express — and verifying even simple properties about it is provably intractable.
Here is a fact that sounds like it should settle the transformer-vs-RNN debate: RNNs can recognize strictly more languages than transformers.
With fixed precision (the kind you actually use on GPUs with float16 or int8), transformers recognize exactly the star-free languages — a strict subclass of regular languages. RNNs, by contrast, recognize all regular languages. The language (aa)* — strings of a's whose length is even — is regular but not star-free. An RNN can recognize it. A fixed-precision transformer cannot.
Case closed? Transformers are weaker than RNNs?
Not so fast. This framing treats expressivity as a binary question: "can this model class recognize language L — yes or no?" But it ignores a crucial practical dimension. When both models can recognize a language, how much machinery does each one need?
Two dimensions of model power. Click each tab to see what they measure.
Think of it this way. Roman numerals and Hindu-Arabic numerals can both express any positive integer. In terms of "expressivity," they are equivalent. But try writing the number one million in Roman numerals. You need over a thousand characters. In Hindu-Arabic, you need seven: 1,000,000. The Hindu-Arabic system is exponentially more succinct.
This is not just an aesthetic difference. Succinctness has computational consequences. It is easy to check whether two Hindu-Arabic numbers are equal — just compare digit by digit, left to right. Doing the same with Roman numerals requires first normalizing both representations, a much more expensive operation. The more succinctly you can describe something, the harder it is to analyze.
The result is striking. For the same language L that both a transformer and an RNN can recognize:
And as a direct consequence: verifying whether a transformer recognizes a trivial language (e.g., the empty set) is EXPSPACE-complete — meaning it provably requires double-exponential time in the worst case. For DFAs, the same question is solvable in polynomial time. For LTL, it is PSPACE-complete. Transformers push verification complexity to a whole new level.
The paper's central idea can be stated in one sentence:
How does attention achieve this? The key is that attention is a pattern matcher: it can search for the most recent occurrence of a specific bit pattern in the input sequence. A binary counter works by flipping the rightmost 0 to 1 and resetting everything to the right. To verify that a counter increments correctly, you need to check that each new counter value differs from the previous one in exactly the right way.
An RNN processes tokens one at a time, left to right. To verify a counter with 2n bits, the RNN must store the entire current counter value in its hidden state — which requires 2n bits of state, meaning at least 22n distinct states. The RNN's sequential processing forces it to remember the counter value.
A transformer, by contrast, sees the entire sequence at once. It doesn't need to remember the counter value — it can look back at the previous counter value using attention and compare the two values bit by bit. The comparison logic is the same regardless of how large the counter is. All that scales with n is the number of bits per counter segment, not the number of layers or parameters in the transformer.
An RNN must carry the counter value forward through time. A transformer can look back at any position.
This is the fundamental asymmetry. The transformer's attention mechanism is like having random access to a tape — you can jump to any position and read what's there. The RNN's recurrence is like having a one-way read head — you can only move forward, and anything you want to remember must be packed into a fixed-size state vector.
Random access is more powerful than sequential access for certain tasks. But in what precise sense? The paper formalizes this as succinctness: measuring how many symbols it takes to write down a model that recognizes a given language. The transformer description is polynomial in n. The automaton description is doubly exponential. The RNN and LTL descriptions are singly exponential. These gaps are tight — the paper proves matching upper bounds showing you cannot do worse.
The paper doesn't just exhibit one language with a succinctness gap. It proves that the gap is inherent — no matter how cleverly you try to compress the alternative representation, the exponential (or doubly exponential) blowup is unavoidable. The argument goes through three steps:
The critical observation: if a transformer of size polynomial in n can recognize a language L whose shortest string has length 22n, then any DFA recognizing L needs at least 22n states (because a DFA for a non-empty language always accepts some word shorter than its state count). Similarly, any LTL formula for L must have size at least 2n (because the shortest word accepted by an LTL formula of size s is at most exponential in s).
Before we dive into the technical machinery, we need to understand the landscape of formal languages that transformers and RNNs inhabit. This chapter builds the vocabulary we need for the rest of the lesson.
An alphabet Σ is a finite set of symbols (think: tokens). A word (or string) is a finite sequence of symbols from Σ, like aabbc or 01001. We write Σ* for the set of all words over Σ (including the empty word), and Σ+ for all non-empty words.
A language L ⊆ Σ* is any set of words. The language could be finite ("all strings of length exactly 3") or infinite ("all strings with an even number of a's"). Language theory studies which machines can decide membership in which languages — given a word w, does w belong to L?
The simplest interesting class of languages is the regular languages. A language is regular if and only if it can be recognized by a deterministic finite automaton (DFA) — a machine with a finite number of states that reads input one symbol at a time and transitions between states according to a fixed table.
This DFA accepts strings that are zero or more a's followed by zero or more b's. Click tokens to step through.
The size of a DFA is the number of states it has. A key property: if a DFA with n states accepts any word at all, then it accepts some word of length at most n. This is the pumping lemma in disguise — and it will be critical for the succinctness arguments later.
Now for the crucial subclass. A star-free regular expression is built from the empty set ∅, single letters {a}, and the operators: union (∪), concatenation (·), and complementation (overline). Critically, it does not use the Kleene star (*). Instead, the complement operator provides the power to express infinite patterns.
For example, the language a*b* (all a's followed by all b's) is star-free because it can be written as:
Reading this: ∅ is the complement of the empty set — meaning "all strings." So ∅ · b · a · ∅ describes "any string containing b followed by a." Complementing that gives us "all strings that do NOT contain b followed by a" — which is exactly a*b*.
Wait — where did the complement go? Let me be precise. The complement of (Σ* · b · a · Σ*) equals a*b*. We're using Σ* = ∅.
But the language (aa)* — strings of a's with even length — is not star-free. Proving this requires the machinery of semigroup theory (specifically, the syntactic monoid of (aa)* is not aperiodic). The intuition: star-free languages cannot count modulo any number. They can distinguish patterns and orderings, but not parities.
| Language class | Example | Machine model | Logic |
|---|---|---|---|
| Star-free | a*b* (a's then b's) | Counter-free DFA | FO[<] / LTL |
| Regular | (aa)* (even-length a's) | DFA / NFA | MSO[<] |
| Context-free | anbn | Pushdown automaton | — |
Fixed-precision transformers recognize exactly the star-free languages. Fixed-precision RNNs recognize exactly the regular languages. This is the expressivity gap — RNNs cover a strictly larger class. But within the star-free languages that both can handle, the question of succinctness remains wide open, and that is what this paper settles.
The paper works with precise mathematical models of transformers, temporal logic, and a programming language called B-RASP. Let's define each one carefully, because the succinctness results depend on the exact definitions.
A UHAT is a simplified but mathematically precise model of a transformer encoder. It processes a sequence of tokens and outputs a sequence of vectors of the same length. Here are the components:
Token embedding: A function emb: Σ → Qd that maps each token to a rational-valued vector. This extends to sequences: emb(a1...an) = emb(a1)...emb(an).
Attention layer (UHA): This is where the magic happens. A unique hard-attention layer has width r and consists of:
On input v1,...,vn, the layer computes:
For each position i, find the unmasked position j that maximizes the score (breaking ties with min or max). Copy that position's vector as the attention vector ai = vj. Then output C(vi, ai) — an affine function of the current position and the attended position.
ReLU layer: Applies max(0, x) to one coordinate of each vector. This gives the transformer the ability to compute conditional logic (if-then-else via ReLU).
Full UHAT: Token embedding, followed by a fixed sequence of UHA and ReLU layers. To use it as a language recognizer, we add an acceptance vector t and check whether 〈t, vk〉 > 0 at the output position k (first or last).
Watch how a UHAT processes a 4-token input. Hover over layers to see data shapes.
LTL is a logic for describing properties of sequences. A formula in LTL over alphabet Σ is built from:
We also define shortcuts: Pφ = ⊤ S φ (at some past position, φ held), Fφ = ⊤ U φ (at some future position, φ holds), Xφ = ⊥ U φ (at the next position, φ holds), Gφ = φ ∧ ¬F¬φ (globally, φ holds at all future positions).
A key theorem due to Kamp (1968): LTL is expressively equivalent to the star-free languages. So LTL and UHATs recognize exactly the same class of languages — but at potentially very different descriptional cost.
Let's see an example. The language (ab)+ (one or more repetitions of "ab") can be written in LTL as:
In words: the first letter is a, every a is followed by b, and every b (that has a successor) is followed by a.
B-RASP is a programming language equivalent to UHATs, introduced by Yang et al. (2024). It is often easier to reason about B-RASP programs than UHATs because the operations are more explicit.
A B-RASP program starts with Boolean vectors P1,...,P|Σ| where Pa(i) = 1 iff position i contains token a. It then defines new Boolean vectors using two operations:
Position-wise operation: Pt+1(i) := R(i), where R is any Boolean combination of the existing vectors at position i. This is like a feedforward layer operating independently at each position.
Attention operation:
This says: among all unmasked positions j where the score predicate S(i,j) is true, select the leftmost (◂) or rightmost (▸) one, call it j*. Set Pt+1(i) to V(i,j*) — the value predicate evaluated at positions i and j*. If no such j* exists, use the default D(i).
The size of a representation R (whether UHAT, LTL formula, DFA, RNN, or B-RASP program) is the length of its binary encoding, denoted |R|. For RNNs, the precision k is counted in unary to ensure fair comparison.
We say class C1 is exponentially more succinct than class C2 if for every sub-exponential function f (i.e., f ∈ 2o(n)), there exists an R1 ∈ C1 such that any R2 ∈ C2 recognizing the same language satisfies |R2| > f(|R1|). In other words: no matter how cleverly you try to compress the C2 representation, you cannot avoid an exponential blowup.
Doubly exponentially more succinct is the same but with functions f ∈ 22o(n).
This is the heart of the paper. Everything else — the succinctness theorems, the EXPSPACE-completeness results — flows from one remarkable construction: a small B-RASP program that can verify a binary counter counting from 0 to 2n−1.
Consider strings over the alphabet {0, 1, a, b, c, #} of the following form:
Each segment has n bits (here n=4), then a letter from {a,b,c}, then a separator #. The n-bit prefix is a binary counter that increments by 1 between consecutive segments. The letters must satisfy some constraint H — say, adjacent letters must be compatible: H = {(a,b), (b,c), (b,a), (c,b)}.
This language LH,n has an interesting property: its shortest word has length Θ(n · 2n), because you need 2n segments to count from 0 to 2n−1. But we can recognize it with a B-RASP program of size O(n). How?
The crucial challenge: how does the B-RASP program verify that each counter value is exactly the previous one plus 1? Let's work through this step by step.
First, we define Boolean vectors C1,...,Cn where Ck(i) records the k-th bit of the counter at position i (extracted from the input using position-wise operations and attention).
Now, to check that the counter at position i is the counter at position j plus 1, we use a classic binary increment rule. If the counter value at j is b1j...bnj, then j+1 has value b1j...bnj + 1. The increment rule says:
How do we express "bi = bj + 1" as a Boolean formula over individual bits? We need to find a bit position k such that:
As a single B-RASP attention operation:
B-RASP C+1(i) := ▸j [j < i, Q#(j)] /* Score: find rightmost # to the left of i */ Value predicate: OR over k=1..n: ¬Ck(i) ∧ Ck(j) /* bit k: 0 in new, 1 in old (flipped 1→0) */ ∧ Ck+1..n(i) ↔ Ck+1..n(j) /* higher bits: unchanged */ /* (lower bits: all 1 in old, all 0 in new - implied) */
Wait — I wrote that too quickly. Let me be precise about what the actual B-RASP operation looks like from the paper. Here is the real attention operation for the increment check:
Let me parse this piece by piece:
Actually, looking at the paper more carefully: Ck(i) corresponds to bit k at position i, and Ck(j) to bit k at the previously found # position. The value predicate checks that the bits at position i form the number that is the bits at position j incremented by 1. The formula checks: there exists a bit position k where bit k flips from 0 to 1, all lower bits flip from 1 to 0, and all higher bits stay the same.
Watch how attention verifies a counter increment. The current position (orange) attends back to the previous # (teal). The value predicate checks the increment rule bit by bit. Use the slider to step through the sequence.
The second condition is simpler. We need to verify that adjacent letters satisfy (aj, ai) ∈ H. This is another attention operation:
This says: find the rightmost letter to my left. Check that the pair (that letter, my letter) is in H.
The B-RASP program has O(n) Boolean vectors (the Ck's plus some auxiliary ones) and O(1) attention operations (each of constant depth but with score and value predicates that mention O(n) vectors). The total size is O(n).
But the shortest accepted word has length Θ(n · 2n) — exponential in the program size. This is the core of the succinctness gap: a polynomial-size transformer encodes a language whose shortest word is exponential.
But wait — can we push this further? From exponential to doubly exponential? Yes. That's the topic of the next chapter.
The binary counter trick from Chapter 4 gives us a language whose shortest word is singly exponential in the program size. To get the EXPSPACE-completeness result and the doubly exponential succinctness gap, the paper goes one step further: it encodes tiling problems.
Imagine you have a collection of square tiles, each with colored edges (top, right, bottom, left). You need to tile a grid such that adjacent tiles have matching edge colors. This is a tiling problem.
The specific variant used in the paper is the 2n-tiling problem:
Why EXPSPACE? Because a tiling with 2n columns can simulate a Turing machine with an exponentially long tape (2n cells). Each row of the tiling encodes one configuration of the Turing machine. The horizontal constraints ensure each row is a valid configuration, and the vertical constraints ensure each row follows from the previous one by a valid transition.
The paper encodes a tiling configuration as a string over Σ = T ∪ {0, 1, #}:
where 〈i〉 is the n-bit binary encoding of column index i. Each segment is: n bits (column counter) + tile symbol + separator #. The rows are concatenated one after another.
The B-RASP program must verify five conditions. The first three (format, boundary conditions) are straightforward. The critical ones are:
Condition 4 (horizontal matching): For adjacent tiles in the same row, right-edges must match left-edges. This is exactly the kind of constraint we checked in Chapter 4 — look back at the previous tile symbol using attention with strict future masking, and verify the constraint.
Condition 5 (vertical matching): For tiles in the same column of adjacent rows, up-edges must match down-edges. This is the clever part: how do you find the tile in the same column of the previous row?
Here is the attention operation for the vertical check:
The score predicate Q#(j) ∧ ⋀k Ck(i) ↔ Ck(j) says: "j is a # position AND the counter at j matches the counter at i." Among all such positions, we take the rightmost one — which is the same column in the most recent previous row. The value predicate V then checks that the vertical edge constraints are satisfied.
A 2D tiling encoded as a 1D string. The orange position verifies its horizontal constraint by attending to the previous tile (left arrow). It verifies its vertical constraint by attending to the matching counter in the previous row (up arrow). Click tiles to see which attention pattern fires.
The B-RASP program has size O(n) — polynomial in the input parameter n. But the tiling has 2n columns and potentially 22n rows (because the Turing machine being simulated can run for doubly exponentially many steps). So the shortest accepted string can have length 22Ω(n) — doubly exponential in the program size.
This is the key lemma:
Combined with Proposition 7 (the 2n-tiling problem is EXPSPACE-complete), this gives us:
This proves the lower bound: UHAT non-emptiness is EXPSPACE-hard.
We now have all the ingredients to state and prove the paper's main theorems. This chapter derives the succinctness gaps between UHATs, LTL, RNNs, and finite automata.
The proof constructs a family of languages {Ln}n≥1 with the following properties:
Where does the family come from? From the tiling construction. Take a Turing machine Mn that implements a 2n-bit binary counter (counting from 02n to 12n). This machine uses linearly many states but an exponentially long tape, and its computation takes at least 22n steps.
Reduce Mn to a tiling problem instance In of size polynomial in n (using the standard Turing-machine-to-tiling reduction from van Emde Boas, 1997). By Lemmas 8 and 9, construct a UHAT Tn of size polynomial in n recognizing valid tiling encodings. The shortest accepted word has length at least 22n.
Now the key argument for the lower bound on LTL. There is a classical result about LTL formulas:
Let's do the math. Suppose φn is an LTL formula of size s that recognizes Ln. Then φn accepts some word of length at most 2O(s). But the shortest word in Ln has length at least 22n. So:
The LTL formula must have exponential size. Meanwhile, the UHAT has polynomial size. The gap is exponential.
Watch how the representation sizes diverge as n grows. The y-axis is log-scale. Drag n to see the gap widen.
Yes. The paper also proves the matching upper bound:
Proposition 16: Given any LTL formula φ, one can construct in polynomial time a UHAT recognizing the same language.
The proof is by induction on the structure of φ. The key case is the "Since" operator φ1 S φ2: assuming we've already computed the truth values of φ1 and φ2 at every position, we use an attention layer with strict future masking and rightmost tie-breaking to find the most recent position where ¬φ1 ∨ φ2 holds, then check whether φ2 holds there.
This means: LTL → UHAT in polynomial time, but UHAT → LTL requires exponential time (Proposition 13). The gap is exactly one exponential.
Take the same UHAT Tn from above. It has polynomial size and its shortest accepted word has length 22n. Now use the DFA shortest-word property:
So any DFA recognizing Ln needs at least 22n states. The UHAT has polynomial size. The gap is doubly exponential.
This follows by combining Theorem 17 with Proposition 3: any RNN with d-dimensional state and k-bit precision can be simulated by a DFA with 2kd states. So the RNN description size (which includes the precision k in unary) is at least logarithmic in the DFA size — meaning the RNN size is at least 2n when the DFA size is 22n.
Since RNNs recognize strictly more languages (all regular, not just star-free), this corollary must be interpreted carefully: for the star-free languages that both can recognize, the transformer description can be exponentially smaller.
| From \ To | UHAT | LTL | RNN | DFA |
|---|---|---|---|---|
| UHAT | — | exp (Prop. 13) | exp | 2-exp |
| LTL | poly (Prop. 16) | — | poly | exp |
| RNN | — | — | — | exp (Prop. 3) |
| DFA | exp | exp | poly | — |
Entry (row R, column C) shows the worst-case blowup when translating from representation R to representation C. "exp" means exponential, "2-exp" means doubly exponential, "poly" means polynomial.
Succinctness is not just an abstract property — it has direct computational consequences. The more succinct a representation, the harder it is to analyze. This chapter derives the paper's complexity results.
The most basic question you can ask about a language recognizer: does it accept any word at all? This is the non-emptiness problem.
| Representation | Non-emptiness complexity |
|---|---|
| DFA | P (polynomial time — just check reachability of an accepting state) |
| LTL | PSPACE-complete (Sistla & Clarke, 1985) |
| UHAT | EXPSPACE-complete (this paper, Theorem 5) |
Each jump in succinctness corresponds to a jump in the complexity of the non-emptiness problem. This is not a coincidence — it is a theorem.
We already built the machinery for this in Chapter 5. The 2n-tiling problem is EXPSPACE-complete (Proposition 7). We showed how to reduce it to UHAT non-emptiness in polynomial time (Lemmas 8 and 9). Therefore, UHAT non-emptiness is EXPSPACE-hard.
Let's trace through what this means computationally. EXPSPACE-complete means that any algorithm solving the problem requires space at least 2Ω(n) in the worst case (under standard complexity assumptions). Time-wise, this means at least 22Ω(n) time — doubly exponential. For comparison:
Time to verify non-emptiness for a representation of size n. Note the log-log scale.
The matching upper bound comes from two key results:
Proposition 12: The values occurring during a UHAT computation can be represented with polynomially many bits. This is a technical but crucial observation. Even though attention layers compose multiplicatively, the score function results are not forwarded to the next layer — only the selected vectors are. This prevents exponential bit-growth.
Why does this matter? Because it enables:
Proposition 13: Any UHAT of size n can be converted in exponential time to an equivalent LTL formula of exponential size.
The conversion works as follows. Because all intermediate values use polynomially many bits (Proposition 12), we can precompute all possible affine transformation results and score comparisons during the construction of the LTL formula. The LTL formula only needs to simulate the position-wise behavior — which positions attend to which — not the numerical computations.
Once we have an exponential-size LTL formula, we can check non-emptiness in PSPACE in the size of the formula (Sistla & Clarke, 1985). PSPACE in the exponential-size formula means EXPSPACE in the original UHAT size.
The paper also examines what happens with restricted masking and tie-breaking patterns:
This is a deep insight: the full EXPSPACE-hardness of verification requires transformers that use both forward-looking and backward-looking attention. If you restrict a transformer to only look in one direction (like a causal GPT-style model), verification becomes easier — though still intractable.
The paper extends the intractability result to other verification problems:
Theorem 19: Equivalence of UHATs (do two transformers recognize the same language?) is EXPSPACE-complete.
The lower bound reduces from non-emptiness: a UHAT T recognizes the empty language iff T is equivalent to a fixed "always reject" UHAT. The upper bound converts both UHATs to LTL formulas and checks equivalence in PSPACE.
Similarly, universality (does the UHAT accept all words?) is EXPSPACE-complete.
Let's step back from the formal machinery and ask: what does this paper actually tell us about transformers in practice?
The theoretical community spent years debating whether transformers are "as expressive as" RNNs. The answer is no — RNNs can recognize (aa)* and transformers cannot. But this paper reframes the debate entirely. For the languages transformers can recognize, they do so vastly more efficiently than RNNs. A transformer of size n encodes patterns that an RNN of size n cannot touch.
Think of it as vocabulary vs. eloquence. English has more words than French (vocabulary/expressivity). But a skilled French poet can say in 14 lines what a clumsy English writer needs 14 pages for (eloquence/succinctness). Succinctness measures eloquence.
The EXPSPACE-completeness result is a negative result for AI safety and interpretability. It says that checking even the most basic properties of a transformer — "does it ever produce output X?" — is provably intractable in the worst case. No clever algorithm can avoid doubly exponential runtime.
The succinctness gap comes specifically from attention's ability to do content-addressed lookup. An RNN must carry everything it needs in its fixed-size hidden state. A transformer can store information in the input sequence itself and retrieve it later using attention. This is not just a practical convenience — it is a provable computational advantage.
Each architecture occupies a different point in the expressivity-succinctness space. More expressive isn't always better — succinctness matters too.
The paper explicitly notes that their RNN results apply to state-space models (SSMs) like Mamba (Gu & Dao, 2023). Since SSMs with fixed precision are also equivalent to finite automata (Merrill et al., 2024), they suffer the same exponential succinctness penalty relative to transformers. This is a theoretical argument for attention-based architectures even if SSMs are more efficient at inference.
The paper draws a fascinating parallel to number systems. Zipf's law of abbreviation says that frequently occurring concepts tend to have succinct descriptions. The Hindu-Arabic numeral system is exponentially more succinct than Roman numerals for expressing large numbers. This succinctness, the paper argues, "potentially enables mathematics and computer science as we see today." Similarly, the succinctness of transformers may be part of why they have been so successful in practice.
It is important to be clear about the limitations:
This chapter places the paper in context and links to related work and open questions.
| Paper | Result | Relation to this work |
|---|---|---|
| Yang et al. (2024) | UHATs = star-free languages; B-RASP ≡ UHAT | Foundation — this paper extends with succinctness |
| Jerad et al. (2025) | Bounds on UHATs transfer to softmax transformers | Makes UHAT results practically relevant |
| Sälzer et al. (2025) | Fixed-precision transformers are NEXP-hard to verify | This paper improves to EXPSPACE-complete (tighter) |
| Merrill et al. (2024) | SSMs with fixed precision = regular languages | SSMs are exponentially less succinct than transformers |
| Li & Cotterell (2025) | Characterizing softmax transformer expressivity | Open: succinctness of softmax transformers |
| Sistla & Clarke (1985) | LTL non-emptiness is PSPACE-complete | Used for the EXPSPACE upper bound |
| Symbol | Meaning |
|---|---|
| Σ | Alphabet (finite set of tokens) |
| UHAT | Unique Hard-Attention Transformer (fixed-precision) |
| B-RASP | Boolean RASP — programming language equivalent to UHAT |
| LTL | Linear Temporal Logic — logic equivalent to star-free languages |
| |R| | Size of representation R (binary encoding length) |
| 2n-tiling | Tiling a grid with 2n columns (EXPSPACE-complete) |
| Theorem | Statement |
|---|---|
| Thm 5 | UHAT/B-RASP non-emptiness is EXPSPACE-complete |
| Thm 15 | UHATs are exponentially more succinct than LTL |
| Thm 17 | UHATs are doubly exponentially more succinct than DFA |
| Cor 18 | UHATs are exponentially more succinct than RNN |
| Thm 19 | UHAT equivalence is EXPSPACE-complete |
| Prop 13 | UHAT → LTL in exponential time (improved from doubly exponential) |
| Prop 16 | LTL → UHAT in polynomial time |