Transformers are Inherently Succinct

Chapter 0: The Problem

Here is a fact that sounds like it should settle the transformer-vs-RNN debate: RNNs can recognize strictly more languages than transformers.

With fixed precision (the kind you actually use on GPUs with float16 or int8), transformers recognize exactly the star-free languages — a strict subclass of regular languages. RNNs, by contrast, recognize all regular languages. The language (aa)* — strings of a's whose length is even — is regular but not star-free. An RNN can recognize it. A fixed-precision transformer cannot.

Case closed? Transformers are weaker than RNNs?

Not so fast. This framing treats expressivity as a binary question: "can this model class recognize language L — yes or no?" But it ignores a crucial practical dimension. When both models can recognize a language, how much machinery does each one need?

Expressivity vs. Succinctness

Two dimensions of model power. Click each tab to see what they measure.

Think of it this way. Roman numerals and Hindu-Arabic numerals can both express any positive integer. In terms of "expressivity," they are equivalent. But try writing the number one million in Roman numerals. You need over a thousand characters. In Hindu-Arabic, you need seven: 1,000,000. The Hindu-Arabic system is exponentially more succinct.

This is not just an aesthetic difference. Succinctness has computational consequences. It is easy to check whether two Hindu-Arabic numbers are equal — just compare digit by digit, left to right. Doing the same with Roman numerals requires first normalizing both representations, a much more expensive operation. The more succinctly you can describe something, the harder it is to analyze.

The question this paper asks: Forget which languages transformers can recognize. For the languages they share with other models, how compactly can transformers describe them compared to RNNs, finite automata, and temporal logic? The answer turns out to be: dramatically more compactly. And this succinctness is not free — it makes transformers provably harder to analyze than any of those alternatives.

The result is striking. For the same language L that both a transformer and an RNN can recognize:

The transformer description can be exponentially smaller than the RNN description
The transformer description can be doubly exponentially smaller than the finite automaton
The transformer description can be exponentially smaller than the equivalent LTL formula

And as a direct consequence: verifying whether a transformer recognizes a trivial language (e.g., the empty set) is EXPSPACE-complete — meaning it provably requires double-exponential time in the worst case. For DFAs, the same question is solvable in polynomial time. For LTL, it is PSPACE-complete. Transformers push verification complexity to a whole new level.

Why isn't "RNNs recognize more languages than transformers" the final word on which architecture is more powerful?

Because transformers are faster at inference time Because RNNs require more training data Because expressivity is a binary question (can/cannot) that ignores how compactly each model can encode shared languages — succinctness measures the size of the description needed

Chapter 1: The Key Insight

The paper's central idea can be stated in one sentence:

Attention can implement a binary counter using only a constant number of layers. A transformer of size polynomial in n can count from 0 to 2^2ⁿ — a doubly exponential range. Any finite automaton recognizing the same counting language needs at least 2^2ⁿ states, and any LTL formula needs at least 2ⁿ symbols. The gap is inherent.

How does attention achieve this? The key is that attention is a pattern matcher: it can search for the most recent occurrence of a specific bit pattern in the input sequence. A binary counter works by flipping the rightmost 0 to 1 and resetting everything to the right. To verify that a counter increments correctly, you need to check that each new counter value differs from the previous one in exactly the right way.

An RNN processes tokens one at a time, left to right. To verify a counter with 2ⁿ bits, the RNN must store the entire current counter value in its hidden state — which requires 2ⁿ bits of state, meaning at least 2^2ⁿ distinct states. The RNN's sequential processing forces it to remember the counter value.

A transformer, by contrast, sees the entire sequence at once. It doesn't need to remember the counter value — it can look back at the previous counter value using attention and compare the two values bit by bit. The comparison logic is the same regardless of how large the counter is. All that scales with n is the number of bits per counter segment, not the number of layers or parameters in the transformer.

Sequential Memory vs. Parallel Lookup

An RNN must carry the counter value forward through time. A transformer can look back at any position.

This is the fundamental asymmetry. The transformer's attention mechanism is like having random access to a tape — you can jump to any position and read what's there. The RNN's recurrence is like having a one-way read head — you can only move forward, and anything you want to remember must be packed into a fixed-size state vector.

Random access is more powerful than sequential access for certain tasks. But in what precise sense? The paper formalizes this as succinctness: measuring how many symbols it takes to write down a model that recognizes a given language. The transformer description is polynomial in n. The automaton description is doubly exponential. The RNN and LTL descriptions are singly exponential. These gaps are tight — the paper proves matching upper bounds showing you cannot do worse.

The proof strategy

The paper doesn't just exhibit one language with a succinctness gap. It proves that the gap is inherent — no matter how cleverly you try to compress the alternative representation, the exponential (or doubly exponential) blowup is unavoidable. The argument goes through three steps:

Step 1: Encode

Show transformers can simulate a Turing machine with an exponentially large tape using binary counters encoded via attention

↓

Step 2: Reduce

Reduce the EXPSPACE-complete 2ⁿ-tiling problem to non-emptiness of a polynomial-size transformer

↓

Step 3: Compare

Use complexity-theoretic arguments to show that LTL, RNN, and DFA representations must be exponentially or doubly exponentially larger

The critical observation: if a transformer of size polynomial in n can recognize a language L whose shortest string has length 2^2ⁿ, then any DFA recognizing L needs at least 2^2ⁿ states (because a DFA for a non-empty language always accepts some word shorter than its state count). Similarly, any LTL formula for L must have size at least 2ⁿ (because the shortest word accepted by an LTL formula of size s is at most exponential in s).

Why can a transformer verify a binary counter increment using constant-size machinery while an RNN needs exponentially many states?

The transformer uses attention to look back at the previous counter value and compare bit by bit — the comparison logic is constant-size. The RNN must carry the entire counter value in its hidden state, requiring exponentially many distinct states. Transformers have more parameters than RNNs RNNs cannot represent binary numbers

Chapter 2: Formal Languages — The Landscape

Before we dive into the technical machinery, we need to understand the landscape of formal languages that transformers and RNNs inhabit. This chapter builds the vocabulary we need for the rest of the lesson.

Alphabets, words, and languages

An alphabet Σ is a finite set of symbols (think: tokens). A word (or string) is a finite sequence of symbols from Σ, like aabbc or 01001. We write Σ* for the set of all words over Σ (including the empty word), and Σ⁺ for all non-empty words.

A language L ⊆ Σ* is any set of words. The language could be finite ("all strings of length exactly 3") or infinite ("all strings with an even number of a's"). Language theory studies which machines can decide membership in which languages — given a word w, does w belong to L?

Regular languages and finite automata

The simplest interesting class of languages is the regular languages. A language is regular if and only if it can be recognized by a deterministic finite automaton (DFA) — a machine with a finite number of states that reads input one symbol at a time and transitions between states according to a fixed table.

A DFA for a*b*

This DFA accepts strings that are zero or more a's followed by zero or more b's. Click tokens to step through.

State: q0 (accepting)

The size of a DFA is the number of states it has. A key property: if a DFA with n states accepts any word at all, then it accepts some word of length at most n. This is the pumping lemma in disguise — and it will be critical for the succinctness arguments later.

Star-free languages

Now for the crucial subclass. A star-free regular expression is built from the empty set ∅, single letters {a}, and the operators: union (∪), concatenation (·), and complementation (overline). Critically, it does not use the Kleene star (*). Instead, the complement operator provides the power to express infinite patterns.

For example, the language a*b* (all a's followed by all b's) is star-free because it can be written as:

∅ · b · a · ∅

Reading this: ∅ is the complement of the empty set — meaning "all strings." So ∅ · b · a · ∅ describes "any string containing b followed by a." Complementing that gives us "all strings that do NOT contain b followed by a" — which is exactly a*b*.

Wait — where did the complement go? Let me be precise. The complement of (Σ* · b · a · Σ*) equals a*b*. We're using Σ* = ∅.

But the language (aa)* — strings of a's with even length — is not star-free. Proving this requires the machinery of semigroup theory (specifically, the syntactic monoid of (aa)* is not aperiodic). The intuition: star-free languages cannot count modulo any number. They can distinguish patterns and orderings, but not parities.

The Schützenberger-McNaughton-Papert theorem (1971) gives three equivalent characterizations of star-free languages: (1) definable by star-free regular expressions, (2) recognized by counter-free automata, (3) definable in first-order logic with linear order. This deep connection between algebra, automata, and logic is what makes the succinctness results possible.

The hierarchy

Language class	Example	Machine model	Logic
Star-free	ab (a's then b's)	Counter-free DFA	FO[<] / LTL
Regular	(aa)* (even-length a's)	DFA / NFA	MSO[<]
Context-free	aⁿbⁿ	Pushdown automaton	—

Fixed-precision transformers recognize exactly the star-free languages. Fixed-precision RNNs recognize exactly the regular languages. This is the expressivity gap — RNNs cover a strictly larger class. But within the star-free languages that both can handle, the question of succinctness remains wide open, and that is what this paper settles.

Why is the language (aa)* — even-length strings of a's — not star-free?

Because it requires too many states in a DFA Because star-free languages cannot count modulo any number (they can distinguish patterns and orderings but not parities), and (aa)* requires distinguishing even from odd length Because it is context-free, not regular

Chapter 3: The Players — UHATs, LTL, and B-RASP

The paper works with precise mathematical models of transformers, temporal logic, and a programming language called B-RASP. Let's define each one carefully, because the succinctness results depend on the exact definitions.

Unique Hard-Attention Transformers (UHATs)

A UHAT is a simplified but mathematically precise model of a transformer encoder. It processes a sequence of tokens and outputs a sequence of vectors of the same length. Here are the components:

Token embedding: A function emb: Σ → Q^d that maps each token to a rational-valued vector. This extends to sequences: emb(a₁...a_n) = emb(a₁)...emb(a_n).

Attention layer (UHA): This is where the magic happens. A unique hard-attention layer has width r and consists of:

Three affine transformations A, B: Q^r → Q^r and C: Q^2r → Q^s (the query/key/value transforms)
A mask predicate M(i,j) that controls which positions can attend to which: no masking (M = ⊤), strict future masking (j < i), or strict past masking (j > i)
A tie-breaking rule: when multiple positions have the same maximal score, pick the leftmost (min) or rightmost (max)

On input v₁,...,v_n, the layer computes:

S(v_i, v_j) = ⟨A(v_i), B(v_j)⟩ (score function — dot product of queries and keys)

For each position i, find the unmasked position j that maximizes the score (breaking ties with min or max). Copy that position's vector as the attention vector a_i = v_j. Then output C(v_i, a_i) — an affine function of the current position and the attended position.

Why "unique hard attention"? "Hard" means: instead of softmax over all positions, we pick the single highest-scoring position. "Unique" means: the tie-breaking rule makes the choice deterministic. This is a simplification of real softmax attention, but recent work (Jerad et al., 2025) shows that expressivity bounds on UHATs transfer to realistic fixed-precision softmax transformers. So results about UHATs apply to practical transformers too.

ReLU layer: Applies max(0, x) to one coordinate of each vector. This gives the transformer the ability to compute conditional logic (if-then-else via ReLU).

Full UHAT: Token embedding, followed by a fixed sequence of UHA and ReLU layers. To use it as a language recognizer, we add an acceptance vector t and check whether ⟨t, v_k⟩ > 0 at the output position k (first or last).

UHAT Data Flow

Watch how a UHAT processes a 4-token input. Hover over layers to see data shapes.

Mask:

Linear Temporal Logic (LTL)

LTL is a logic for describing properties of sequences. A formula in LTL over alphabet Σ is built from:

Atoms: Q_a means "the current position is the letter a"
Boolean connectives: ∧, ∨, ¬
Temporal operators:
- φ S ψ ("Since"): there is some past position where ψ held, and φ has held at every position in between
- φ U ψ ("Until"): there is some future position where ψ will hold, and φ holds at every position in between

We also define shortcuts: Pφ = ⊤ S φ (at some past position, φ held), Fφ = ⊤ U φ (at some future position, φ holds), Xφ = ⊥ U φ (at the next position, φ holds), Gφ = φ ∧ ¬F¬φ (globally, φ holds at all future positions).

A key theorem due to Kamp (1968): LTL is expressively equivalent to the star-free languages. So LTL and UHATs recognize exactly the same class of languages — but at potentially very different descriptional cost.

Let's see an example. The language (ab)⁺ (one or more repetitions of "ab") can be written in LTL as:

Q_a ∧ G(Q_a → XQ_b) ∧ G((Q_b ∧ X⊤) → XQ_a)

In words: the first letter is a, every a is followed by b, and every b (that has a successor) is followed by a.

Boolean RASP (B-RASP)

B-RASP is a programming language equivalent to UHATs, introduced by Yang et al. (2024). It is often easier to reason about B-RASP programs than UHATs because the operations are more explicit.

A B-RASP program starts with Boolean vectors P₁,...,P_|Σ| where P_a(i) = 1 iff position i contains token a. It then defines new Boolean vectors using two operations:

Position-wise operation: P_t+1(i) := R(i), where R is any Boolean combination of the existing vectors at position i. This is like a feedforward layer operating independently at each position.

Attention operation:

P_t+1(i) := ◂_j [M(i,j), S(i,j)] V(i,j) : D(i)

This says: among all unmasked positions j where the score predicate S(i,j) is true, select the leftmost (◂) or rightmost (▸) one, call it j*. Set P_t+1(i) to V(i,j*) — the value predicate evaluated at positions i and j*. If no such j* exists, use the default D(i).

B-RASP ≡ UHAT. Yang et al. (2024) proved that B-RASP programs and UHATs are expressively equivalent — they recognize exactly the same languages. Moreover, the translation between them preserves size up to a polynomial factor. So proving things about B-RASP programs immediately gives us results about transformers.

Size and succinctness

The size of a representation R (whether UHAT, LTL formula, DFA, RNN, or B-RASP program) is the length of its binary encoding, denoted |R|. For RNNs, the precision k is counted in unary to ensure fair comparison.

We say class C₁ is exponentially more succinct than class C₂ if for every sub-exponential function f (i.e., f ∈ 2^o(n)), there exists an R₁ ∈ C₁ such that any R₂ ∈ C₂ recognizing the same language satisfies |R₂| > f(|R₁|). In other words: no matter how cleverly you try to compress the C₂ representation, you cannot avoid an exponential blowup.

Doubly exponentially more succinct is the same but with functions f ∈ 2^{2^o(n)}.

In B-RASP, what does the attention operation P_t+1(i) := ▸_j[j < i, S(i,j)] V(i,j) : D(i) do?

It computes the average of all positions j where S(i,j) is true Among all positions j to the left of i (j < i) where the score predicate S(i,j) is true, it selects the rightmost one (j*) and returns V(i,j*) — or D(i) if no such j exists It scans right to left and returns the first position where V is true

Chapter 4: The Binary Counter Trick

This is the heart of the paper. Everything else — the succinctness theorems, the EXPSPACE-completeness results — flows from one remarkable construction: a small B-RASP program that can verify a binary counter counting from 0 to 2ⁿ−1.

The language

Consider strings over the alphabet {0, 1, a, b, c, #} of the following form:

0000a₁#0001a₂#0010a₃#...#1111a₁₆#

Each segment has n bits (here n=4), then a letter from {a,b,c}, then a separator #. The n-bit prefix is a binary counter that increments by 1 between consecutive segments. The letters must satisfy some constraint H — say, adjacent letters must be compatible: H = {(a,b), (b,c), (b,a), (c,b)}.

This language L_H,n has an interesting property: its shortest word has length Θ(n · 2ⁿ), because you need 2ⁿ segments to count from 0 to 2ⁿ−1. But we can recognize it with a B-RASP program of size O(n). How?

Checking the counter increment

The crucial challenge: how does the B-RASP program verify that each counter value is exactly the previous one plus 1? Let's work through this step by step.

First, we define Boolean vectors C₁,...,C_n where C_k(i) records the k-th bit of the counter at position i (extracted from the input using position-wise operations and attention).

Now, to check that the counter at position i is the counter at position j plus 1, we use a classic binary increment rule. If the counter value at j is b₁^j...b_n^j, then j+1 has value b₁^j...b_n^j + 1. The increment rule says:

Binary increment: find the rightmost 0 bit, flip it to 1, and set all bits to its right to 0. So 0011 → 0100 (bit 3 is the rightmost 0, flip it, clear bits 1-2). This means: for increment position k, bits 1...(k-1) change from 1 to 0, bit k changes from 0 to 1, and bits (k+1)...n stay the same.

How do we express "bⁱ = b^j + 1" as a Boolean formula over individual bits? We need to find a bit position k such that:

Bit k: b_k^j = 0 and b_kⁱ = 1 (the bit that flips from 0 to 1)
Bits 1,...,k-1: b_r^j = 1 and b_rⁱ = 0 for all r < k (the carry bits that flip from 1 to 0)
Bits k+1,...,n: b_r^j = b_rⁱ for all r > k (higher bits stay the same)

As a single B-RASP attention operation:

B-RASP
C₊₁(i) := ▸_j [j < i, Q_#(j)]
  /* Score: find rightmost # to the left of i */
  Value predicate:
    OR over k=1..n:
      ¬C_k(i) ∧ C_k(j)           /* bit k: 0 in new, 1 in old (flipped 1→0) */
      ∧ C_k+1..n(i) ↔ C_k+1..n(j)  /* higher bits: unchanged */
      /* (lower bits: all 1 in old, all 0 in new - implied) */

Wait — I wrote that too quickly. Let me be precise about what the actual B-RASP operation looks like from the paper. Here is the real attention operation for the increment check:

C₊₁(i) := ▸_j [j < i, Q_#(j)] ⋁_k=1⁴ (¬C_k(i) ∧ C_k(j) ∧ ⋀_r=1^k-1 C_k(i) ∧ ¬C_k(j) ∧ ⋀_r=k+1⁴ (C_r(i) ↔ C_r(j))) : 1

Let me parse this piece by piece:

▸_j [j < i, Q_#(j)] — Among positions to the left of i, find the rightmost # separator. This locates the previous counter segment.
The big disjunction ⋁_k — Try each possible "flip position" k. For some k, the increment happens at bit k.
¬C_k(i) ∧ C_k(j) — At bit k, the new value has 0 where the old had 1 (this is bit k changing from 1 to 0)... wait. Actually, let me re-read the paper's convention.

Actually, looking at the paper more carefully: C_k(i) corresponds to bit k at position i, and C_k(j) to bit k at the previously found # position. The value predicate checks that the bits at position i form the number that is the bits at position j incremented by 1. The formula checks: there exists a bit position k where bit k flips from 0 to 1, all lower bits flip from 1 to 0, and all higher bits stay the same.

Binary Counter Verification via Attention

Watch how attention verifies a counter increment. The current position (orange) attends back to the previous # (teal). The value predicate checks the increment rule bit by bit. Use the slider to step through the sequence.

Position 5

Bits (n) 3

Checking horizontal constraints

The second condition is simpler. We need to verify that adjacent letters satisfy (a_j, a_i) ∈ H. This is another attention operation:

M_←(i) := ▸_j [j < i, Q_a(j) ∨ Q_b(j) ∨ Q_c(j)] ⋁_(h,h')∈H Q_h(j) ∧ Q_h'(i) : 1

This says: find the rightmost letter to my left. Check that the pair (that letter, my letter) is in H.

The size analysis

The B-RASP program has O(n) Boolean vectors (the C_k's plus some auxiliary ones) and O(1) attention operations (each of constant depth but with score and value predicates that mention O(n) vectors). The total size is O(n).

But the shortest accepted word has length Θ(n · 2ⁿ) — exponential in the program size. This is the core of the succinctness gap: a polynomial-size transformer encodes a language whose shortest word is exponential.

Worked example (n=3, 3 bits). Counter segments: 000a#001b#010c#011a#100b#101c#110a#111b#. That is 8 segments (2³ = 8), each of length 5, total length 40. The B-RASP program to verify this has ~3 Boolean vectors for the bits, ~3 attention operations. Size ~O(3) = constant. But the shortest accepted word is 40 characters long — already growing. With n=10, you need 1024 segments and the shortest word has ~11,264 characters, verified by a program of size ~10.

But wait — can we push this further? From exponential to doubly exponential? Yes. That's the topic of the next chapter.

In the binary counter B-RASP program, what role does the attention operation play in verifying the counter increment?

It looks back to the rightmost # separator (finding the previous counter segment), then uses the value predicate to check bit-by-bit that the current counter value is the previous value plus 1 It counts the total number of 1-bits in the sequence It stores the counter value in the hidden state for the next position to read

Chapter 5: From Counters to Tilings

The binary counter trick from Chapter 4 gives us a language whose shortest word is singly exponential in the program size. To get the EXPSPACE-completeness result and the doubly exponential succinctness gap, the paper goes one step further: it encodes tiling problems.

What is a tiling problem?

Imagine you have a collection of square tiles, each with colored edges (top, right, bottom, left). You need to tile a grid such that adjacent tiles have matching edge colors. This is a tiling problem.

The specific variant used in the paper is the 2ⁿ-tiling problem:

Given: An integer n (in unary), a finite set T of tiles, and a special "final" tile t_fin
Question: Can you tile a grid with 2ⁿ columns and some number m of rows such that:
1. The tile at position (2ⁿ, m) is t_fin
2. The bottom row has 0 on its down-edges, the top row has 0 on its up-edges
3. The left column has 0 on its left-edges, the right column has 0 on its right-edges
4. Adjacent tiles in the same row have matching horizontal edges: right(i,j) = left(i+1,j)
5. Adjacent tiles in the same column have matching vertical edges: up(i,j) = down(i,j+1)

Theorem (Schwarzentruber, 2019): The 2ⁿ-tiling problem is EXPSPACE-complete. This is a classic result from complexity theory — tiling problems provide a natural way to encode space-bounded computation.

Why EXPSPACE? Because a tiling with 2ⁿ columns can simulate a Turing machine with an exponentially long tape (2ⁿ cells). Each row of the tiling encodes one configuration of the Turing machine. The horizontal constraints ensure each row is a valid configuration, and the vertical constraints ensure each row follows from the previous one by a valid transition.

Encoding a tiling as a string

The paper encodes a tiling configuration as a string over Σ = T ∪ {0, 1, #}:

enc(τ) = ⟨0⟩t_1,1# ⟨1⟩t_2,1# ... ⟨2ⁿ−1⟩t_2ⁿ,1# ⟨0⟩t_1,2# ... ⟨2ⁿ−1⟩t_2ⁿ,m#

where ⟨i⟩ is the n-bit binary encoding of column index i. Each segment is: n bits (column counter) + tile symbol + separator #. The rows are concatenated one after another.

The B-RASP program

The B-RASP program must verify five conditions. The first three (format, boundary conditions) are straightforward. The critical ones are:

Condition 4 (horizontal matching): For adjacent tiles in the same row, right-edges must match left-edges. This is exactly the kind of constraint we checked in Chapter 4 — look back at the previous tile symbol using attention with strict future masking, and verify the constraint.

Condition 5 (vertical matching): For tiles in the same column of adjacent rows, up-edges must match down-edges. This is the clever part: how do you find the tile in the same column of the previous row?

The vertical lookup trick. To find the tile in the same column of the previous row, we use attention with a score predicate that requires matching column counters. The attention operation looks for the rightmost # position to the left where the n-bit counter matches the current position's counter. Because the counter repeats every 2ⁿ segments (once per row), the rightmost match is exactly the same column in the previous row.

Here is the attention operation for the vertical check:

P_vert(i) := ▸_j [j < i, Q_#(j) ∧ ⋀_k=1ⁿ (C_k(i) ↔ C_k(j))] V(i,j) : 1

The score predicate Q_#(j) ∧ ⋀_k C_k(i) ↔ C_k(j) says: "j is a # position AND the counter at j matches the counter at i." Among all such positions, we take the rightmost one — which is the same column in the most recent previous row. The value predicate V then checks that the vertical edge constraints are satisfied.

Tiling Verification via Attention

A 2D tiling encoded as a 1D string. The orange position verifies its horizontal constraint by attending to the previous tile (left arrow). It verifies its vertical constraint by attending to the matching counter in the previous row (up arrow). Click tiles to see which attention pattern fires.

Grid columns (2ⁿ) 4

Selected cell 5

The size analysis

The B-RASP program has size O(n) — polynomial in the input parameter n. But the tiling has 2ⁿ columns and potentially 2^2ⁿ rows (because the Turing machine being simulated can run for doubly exponentially many steps). So the shortest accepted string can have length 2^{2^Ω(n)} — doubly exponential in the program size.

This is the key lemma:

Lemma 8 (from the paper): Given a 2ⁿ-tiling problem instance, one can construct in time polynomial in n a B-RASP program whose language is non-empty if and only if the tiling problem has a solution. Moreover, the B-RASP program has a special form (Lemma 9) that allows polynomial-time translation to a UHAT.

Combined with Proposition 7 (the 2ⁿ-tiling problem is EXPSPACE-complete), this gives us:

EXPSPACE-complete problem

2ⁿ-tiling: does a valid tiling exist?

↓ poly-time reduction (Lemma 8)

B-RASP non-emptiness

Is L(P) ≠ ∅ for a poly-size B-RASP program P?

↓ poly-time translation (Lemma 9)

UHAT non-emptiness

Is L(T) ≠ ∅ for a poly-size UHAT T?

This proves the lower bound: UHAT non-emptiness is EXPSPACE-hard.

How does the B-RASP program find the tile in the same column of the previous row to check vertical constraints?

It stores the previous row in memory It counts the number of # symbols seen so far It uses attention with a score predicate that requires matching n-bit column counters — the rightmost match to the left is the same column in the previous row

Chapter 6: The Succinctness Hierarchy

We now have all the ingredients to state and prove the paper's main theorems. This chapter derives the succinctness gaps between UHATs, LTL, RNNs, and finite automata.

Theorem 15: UHATs are exponentially more succinct than LTL

The proof constructs a family of languages {L_n}_n≥1 with the following properties:

L_n is recognized by a UHAT T_n of size polynomial in n
The shortest word in L_n has length at least 2^2ⁿ
Any LTL formula φ_n recognizing L_n has size at least exponential in n

Where does the family come from? From the tiling construction. Take a Turing machine M_n that implements a 2ⁿ-bit binary counter (counting from 0^2ⁿ to 1^2ⁿ). This machine uses linearly many states but an exponentially long tape, and its computation takes at least 2^2ⁿ steps.

Reduce M_n to a tiling problem instance I_n of size polynomial in n (using the standard Turing-machine-to-tiling reduction from van Emde Boas, 1997). By Lemmas 8 and 9, construct a UHAT T_n of size polynomial in n recognizing valid tiling encodings. The shortest accepted word has length at least 2^2ⁿ.

Now the key argument for the lower bound on LTL. There is a classical result about LTL formulas:

LTL shortest-word bound: If an LTL formula φ of size s accepts any word at all, then it accepts some word of length at most 2^O(s). (This follows from the exponential-time conversion from LTL to finite automata, via Vardi & Wolper, 1994.)

Let's do the math. Suppose φ_n is an LTL formula of size s that recognizes L_n. Then φ_n accepts some word of length at most 2^O(s). But the shortest word in L_n has length at least 2^2ⁿ. So:

2^O(s) ≥ 2^2ⁿ ⇒ O(s) ≥ 2ⁿ ⇒ s = Ω(2ⁿ)

The LTL formula must have exponential size. Meanwhile, the UHAT has polynomial size. The gap is exponential.

Succinctness Gap Growth

Watch how the representation sizes diverge as n grows. The y-axis is log-scale. Drag n to see the gap widen.

n 3

Is the gap tight?

Yes. The paper also proves the matching upper bound:

Proposition 16: Given any LTL formula φ, one can construct in polynomial time a UHAT recognizing the same language.

The proof is by induction on the structure of φ. The key case is the "Since" operator φ₁ S φ₂: assuming we've already computed the truth values of φ₁ and φ₂ at every position, we use an attention layer with strict future masking and rightmost tie-breaking to find the most recent position where ¬φ₁ ∨ φ₂ holds, then check whether φ₂ holds there.

This means: LTL → UHAT in polynomial time, but UHAT → LTL requires exponential time (Proposition 13). The gap is exactly one exponential.

Theorem 17: UHATs are doubly exponentially more succinct than finite automata

Take the same UHAT T_n from above. It has polynomial size and its shortest accepted word has length 2^2ⁿ. Now use the DFA shortest-word property:

DFA shortest-word bound: If a DFA with s states accepts any word, it accepts some word of length at most s. (This is immediate from the pumping lemma — in an accepting run longer than s, some state must repeat, so we can pump down to a shorter accepting word.)

So any DFA recognizing L_n needs at least 2^2ⁿ states. The UHAT has polynomial size. The gap is doubly exponential.

Corollary 18: UHATs are exponentially more succinct than RNNs

This follows by combining Theorem 17 with Proposition 3: any RNN with d-dimensional state and k-bit precision can be simulated by a DFA with 2^kd states. So the RNN description size (which includes the precision k in unary) is at least logarithmic in the DFA size — meaning the RNN size is at least 2ⁿ when the DFA size is 2^2ⁿ.

Since RNNs recognize strictly more languages (all regular, not just star-free), this corollary must be interpreted carefully: for the star-free languages that both can recognize, the transformer description can be exponentially smaller.

The complete picture

From \ To	UHAT	LTL	RNN	DFA
UHAT	—	exp (Prop. 13)	exp	2-exp
LTL	poly (Prop. 16)	—	poly	exp
RNN	—	—	—	exp (Prop. 3)
DFA	exp	exp	poly	—

Entry (row R, column C) shows the worst-case blowup when translating from representation R to representation C. "exp" means exponential, "2-exp" means doubly exponential, "poly" means polynomial.

Why is the succinctness gap between UHATs and DFAs doubly exponential, while the gap between UHATs and LTL is only singly exponential?

Because DFAs must accept some word of length at most equal to their state count (so the shortest word 2^2^n forces 2^2^n states), while LTL formulas can accept words up to exponential in their size (so the shortest word 2^2^n only forces 2^n formula size) Because DFAs are inherently weaker than LTL Because the translation from UHAT to DFA goes through two exponential steps

Chapter 7: Verification Intractability

Succinctness is not just an abstract property — it has direct computational consequences. The more succinct a representation, the harder it is to analyze. This chapter derives the paper's complexity results.

The non-emptiness problem

The most basic question you can ask about a language recognizer: does it accept any word at all? This is the non-emptiness problem.

Representation	Non-emptiness complexity
DFA	P (polynomial time — just check reachability of an accepting state)
LTL	PSPACE-complete (Sistla & Clarke, 1985)
UHAT	EXPSPACE-complete (this paper, Theorem 5)

Each jump in succinctness corresponds to a jump in the complexity of the non-emptiness problem. This is not a coincidence — it is a theorem.

The lower bound: EXPSPACE-hardness

We already built the machinery for this in Chapter 5. The 2ⁿ-tiling problem is EXPSPACE-complete (Proposition 7). We showed how to reduce it to UHAT non-emptiness in polynomial time (Lemmas 8 and 9). Therefore, UHAT non-emptiness is EXPSPACE-hard.

Let's trace through what this means computationally. EXPSPACE-complete means that any algorithm solving the problem requires space at least 2^Ω(n) in the worst case (under standard complexity assumptions). Time-wise, this means at least 2^{2^Ω(n)} time — doubly exponential. For comparison:

DFA non-emptiness (P): Can be solved in O(n) time — just a graph reachability check
LTL non-emptiness (PSPACE): Requires up to 2^O(n) time — singly exponential
UHAT non-emptiness (EXPSPACE): Requires up to 2^{2^O(n)} time — doubly exponential

Complexity of Verification

Time to verify non-emptiness for a representation of size n. Note the log-log scale.

The upper bound: in EXPSPACE

The matching upper bound comes from two key results:

Proposition 12: The values occurring during a UHAT computation can be represented with polynomially many bits. This is a technical but crucial observation. Even though attention layers compose multiplicatively, the score function results are not forwarded to the next layer — only the selected vectors are. This prevents exponential bit-growth.

Why does this matter? Because it enables:

Proposition 13: Any UHAT of size n can be converted in exponential time to an equivalent LTL formula of exponential size.

The conversion works as follows. Because all intermediate values use polynomially many bits (Proposition 12), we can precompute all possible affine transformation results and score comparisons during the construction of the LTL formula. The LTL formula only needs to simulate the position-wise behavior — which positions attend to which — not the numerical computations.

Once we have an exponential-size LTL formula, we can check non-emptiness in PSPACE in the size of the formula (Sistla & Clarke, 1985). PSPACE in the exponential-size formula means EXPSPACE in the original UHAT size.

UHAT (size n)

Original transformer

↓ Prop. 13: exp-time translation

LTL formula (size 2^O(n))

Equivalent temporal logic formula

↓ Sistla-Clarke: PSPACE in formula size

Non-emptiness answer

Using space 2^O(n) = EXPSPACE in UHAT size

What about restricted UHATs?

The paper also examines what happens with restricted masking and tie-breaking patterns:

Corollary 14: If every attention layer uses strict future masking with leftmost tie-breaking (or strict past masking with rightmost tie-breaking), then the UHAT can be translated to an LTL formula using only the P operator (or only F). The non-emptiness problem for these restricted UHATs drops from EXPSPACE to NEXP. The mix of masking directions is what makes verification maximally hard.

This is a deep insight: the full EXPSPACE-hardness of verification requires transformers that use both forward-looking and backward-looking attention. If you restrict a transformer to only look in one direction (like a causal GPT-style model), verification becomes easier — though still intractable.

The equivalence problem

The paper extends the intractability result to other verification problems:

Theorem 19: Equivalence of UHATs (do two transformers recognize the same language?) is EXPSPACE-complete.

The lower bound reduces from non-emptiness: a UHAT T recognizes the empty language iff T is equivalent to a fixed "always reject" UHAT. The upper bound converts both UHATs to LTL formulas and checks equivalence in PSPACE.

Similarly, universality (does the UHAT accept all words?) is EXPSPACE-complete.

Why is the UHAT-to-LTL translation size exponential rather than doubly exponential, given that the UHAT can encode doubly exponentially long words?

Because LTL formulas are exponentially more succinct than UHATs Because the UHAT only has polynomially many layers Because the intermediate values in a UHAT use only polynomially many bits (Prop. 12), so the LTL formula only needs to enumerate all possible score comparisons (exponentially many) rather than all possible counter values (doubly exponentially many)

Chapter 8: Implications — What This Means

Let's step back from the formal machinery and ask: what does this paper actually tell us about transformers in practice?

Implication 1: Expressivity is the wrong question

The theoretical community spent years debating whether transformers are "as expressive as" RNNs. The answer is no — RNNs can recognize (aa)* and transformers cannot. But this paper reframes the debate entirely. For the languages transformers can recognize, they do so vastly more efficiently than RNNs. A transformer of size n encodes patterns that an RNN of size n cannot touch.

Think of it as vocabulary vs. eloquence. English has more words than French (vocabulary/expressivity). But a skilled French poet can say in 14 lines what a clumsy English writer needs 14 pages for (eloquence/succinctness). Succinctness measures eloquence.

Implication 2: Transformers are inherently hard to verify

The EXPSPACE-completeness result is a negative result for AI safety and interpretability. It says that checking even the most basic properties of a transformer — "does it ever produce output X?" — is provably intractable in the worst case. No clever algorithm can avoid doubly exponential runtime.

But wait — doesn't this only apply to the worst case? Yes. The EXPSPACE-hardness proof requires transformers that encode large binary counters. Real LLMs don't count in binary. So practical transformers might be easier to verify than the worst case suggests. The paper explicitly highlights this as an open question: are there natural subclasses of transformers (ones that can't encode counters) with lower verification complexity?

Implication 3: Attention is the key advantage

The succinctness gap comes specifically from attention's ability to do content-addressed lookup. An RNN must carry everything it needs in its fixed-size hidden state. A transformer can store information in the input sequence itself and retrieve it later using attention. This is not just a practical convenience — it is a provable computational advantage.

The Architecture Landscape: Expressivity vs. Succinctness

Each architecture occupies a different point in the expressivity-succinctness space. More expressive isn't always better — succinctness matters too.

Implication 4: State-Space Models inherit the RNN limitation

The paper explicitly notes that their RNN results apply to state-space models (SSMs) like Mamba (Gu & Dao, 2023). Since SSMs with fixed precision are also equivalent to finite automata (Merrill et al., 2024), they suffer the same exponential succinctness penalty relative to transformers. This is a theoretical argument for attention-based architectures even if SSMs are more efficient at inference.

Implication 5: The Roman numeral analogy

The paper draws a fascinating parallel to number systems. Zipf's law of abbreviation says that frequently occurring concepts tend to have succinct descriptions. The Hindu-Arabic numeral system is exponentially more succinct than Roman numerals for expressing large numbers. This succinctness, the paper argues, "potentially enables mathematics and computer science as we see today." Similarly, the succinctness of transformers may be part of why they have been so successful in practice.

What the paper does NOT say

It is important to be clear about the limitations:

Not about training. The paper says nothing about whether transformers can learn to exploit their succinctness advantage. A small transformer exists that recognizes the counting language, but SGD may not find it.
Not about softmax. The results are for unique hard attention. They transfer to fixed-precision softmax (via Jerad et al., 2025), but the quantitative bounds may differ for average-hard attention or non-fixed precision.
Not about specific real-world tasks. The succinctness advantage is demonstrated on formal language recognition tasks (counting, tiling). Whether similar gaps exist for natural language is an open question.
Not about inference efficiency. A more succinct model isn't necessarily faster to run. Transformers have O(n²) attention cost; RNNs have O(n) sequential cost. Succinctness measures description size, not runtime.

Why does the EXPSPACE-hardness of UHAT verification not necessarily mean that verifying real-world LLMs is intractable?

The hardness proof relies on transformers that encode large binary counters — a construction that real LLMs don't use — so practical transformers may belong to a subclass with lower verification complexity Real LLMs use softmax, not hard attention Real LLMs have too many parameters to be verified

Chapter 9: Connections

This chapter places the paper in context and links to related work and open questions.

Related results

Paper	Result	Relation to this work
Yang et al. (2024)	UHATs = star-free languages; B-RASP ≡ UHAT	Foundation — this paper extends with succinctness
Jerad et al. (2025)	Bounds on UHATs transfer to softmax transformers	Makes UHAT results practically relevant
Sälzer et al. (2025)	Fixed-precision transformers are NEXP-hard to verify	This paper improves to EXPSPACE-complete (tighter)
Merrill et al. (2024)	SSMs with fixed precision = regular languages	SSMs are exponentially less succinct than transformers
Li & Cotterell (2025)	Characterizing softmax transformer expressivity	Open: succinctness of softmax transformers
Sistla & Clarke (1985)	LTL non-emptiness is PSPACE-complete	Used for the EXPSPACE upper bound

Open questions raised by the paper

Learnability of succinct transformers: A small transformer exists for the counting language, but can gradient descent find it? Empirical results on length generalization (Garg et al., 2022; Naim et al., 2025) are mixed.
Succinctness of softmax transformers: The paper uses hard attention. What about fixed-precision softmax? Average-hard attention? These are strictly more expressive than UHATs but may or may not be more succinct.
Subclasses with tractable verification: Since the EXPSPACE-hardness requires counter-encoding transformers, are there natural restricted classes (no counter-encoding) with PSPACE or NP verification?
Succinctness without positional masking: The paper uses positional masking as a simple form of positional encoding. Other PE types (sinusoidal, RoPE, ALiBi) may have different succinctness properties.
Beyond regular languages: This paper works within the regular/star-free hierarchy. What about context-free or context-sensitive languages with unbounded precision?

Cheat sheet: key results at a glance

Symbol	Meaning
Σ	Alphabet (finite set of tokens)
UHAT	Unique Hard-Attention Transformer (fixed-precision)
B-RASP	Boolean RASP — programming language equivalent to UHAT
LTL	Linear Temporal Logic — logic equivalent to star-free languages
\|R\|	Size of representation R (binary encoding length)
2ⁿ-tiling	Tiling a grid with 2ⁿ columns (EXPSPACE-complete)

Theorem	Statement
Thm 5	UHAT/B-RASP non-emptiness is EXPSPACE-complete
Thm 15	UHATs are exponentially more succinct than LTL
Thm 17	UHATs are doubly exponentially more succinct than DFA
Cor 18	UHATs are exponentially more succinct than RNN
Thm 19	UHAT equivalence is EXPSPACE-complete
Prop 13	UHAT → LTL in exponential time (improved from doubly exponential)
Prop 16	LTL → UHAT in polynomial time

The takeaway. Transformers are not the most expressive sequence model — RNNs cover a strictly larger language class. But transformers are the most eloquent: they can say in polynomial space what others need exponential or doubly exponential space to express. This succinctness is fundamental to the architecture, powered specifically by attention's ability to do content-addressed lookup. And it comes at a cost: the very compactness that makes transformers powerful also makes them provably hard to analyze.

What is the main open question the paper raises about practical transformers?

Whether transformers are Turing-complete Whether gradient descent can actually learn the succinct representations that provably exist — i.e., whether the succinctness advantage is achievable in practice through training Whether SSMs can be made as succinct as transformers