Ajay Jaiswal, Lauren Hannah, Han-Byul Kim, Duc Hoang, Mehrdad Farajtabar, Minsik Cho (Apple) — arXiv 2026

TIDE: Every Layer Knows the Token Beneath the Context

Standard transformers look up the token embedding once and throw it away. Every subsequent layer flies blind — it processes context but never re-consults which token it is working on. TIDE fixes this by giving every layer persistent, context-free access to token identity.

Prerequisites: Transformer basics (attention, FFN, residual stream) + Embedding lookup intuition + Basic calculus (gradients). That's it.
10
Chapters
10+
Simulations
0
Assumed Knowledge

Chapter 0: The Forgotten Token

Picture yourself in a conversation. Someone says "I saw a crane by the river." You hear the word "crane" once, form a mental image — maybe a bird, maybe construction equipment — and then the word dissolves. For the rest of the conversation, you reason about this crane entirely through context: "it flew away," "it lifted the beam." But you never go back and re-read the original word.

This is exactly how a standard transformer processes language. It looks up each token in an embedding table exactly once — at Layer 0 — and then permanently discards the token index. Every subsequent layer operates on a contextualized hidden state: a blend of information from all surrounding tokens, mixed through attention and feed-forward networks. The original token identity? Gone.

For most tokens, this works fine. "The" and "cat" appear so often that their embeddings are well-trained, and context is usually sufficient to distinguish them. But what about the word "acetaminophen"? Or the number "1847"? Or the rare proper noun "Kvothe"?

The one-shot injection assumption

Let's trace what happens inside a standard transformer. The input sequence "Jack loves his coffee with sugar" enters the model as token indices: [4521, 8923, 1104, 7834, 1205, 9122]. Each index is looked up in an embedding table E to produce a vector:

h(0) = E[token_index] ∈ Rd

This is the only moment the model has direct access to which token it is processing. After this, Layer 1 computes attention over all positions, mixes information, and produces a new hidden state. Layer 2 does the same. By Layer 16 of a deep transformer, the hidden state at position 4 (originally "coffee") is a complex mixture of information from every token in the sequence.

The token index 7834 ("coffee") was consulted once and never again. Compare this to position, which is re-injected at every single attention layer via RoPE (Rotary Position Embeddings). Position gets persistent identity. Token identity gets a single shot.

The asymmetry: In a standard transformer, position is injected at every layer (via RoPE). Token identity is injected once (at Layer 0) and then abandoned forever. TIDE's core insight is that token identity deserves the same persistent treatment as position.

Watch the token fade

The canvas below simulates a simplified transformer. Watch how the "token identity signal" — the model's ability to distinguish which token is at each position purely from the hidden state — fades as you go deeper. By the middle layers, tokens with similar contexts become nearly indistinguishable.

Token Identity Fading Across Layers

Each column is a token. Color opacity represents how much the hidden state retains the original token's unique identity signal. Watch it fade.

This fading is not a bug — it is a feature for common tokens. Contextual mixing is what makes transformers powerful. But for rare tokens whose embeddings are poorly trained, and for distinct tokens that happen to share syntactic context, the loss of token identity is catastrophic.

The 2026 paper by Jaiswal et al. at Apple identifies two specific failure modes caused by this single-injection design, proposes an elegant architectural fix called TIDE (Token Identity Delivered Everywhere), and proves theoretically why it works. Let's build it from scratch.

In a standard transformer, how many times is the token index consulted during a forward pass through L layers?

Chapter 1: The Rare Token Problem

Open any book, any Wikipedia article, any codebase. Count how often each word appears. You will discover one of the most universal laws in language: Zipf's Law. The most frequent word ("the") appears roughly twice as often as the second most frequent ("of"), three times as often as the third, and so on. A tiny fraction of the vocabulary dominates the corpus.

For a modern LLM tokenizer like LLaMA-3 with |V| = 128,256 tokens, this means the top 1% of tokens account for roughly 80% of all occurrences. The bottom 10%? They might appear once every few billion tokens.

Gradient starvation

Here is the problem. Under minibatch SGD, a token embedding ev only receives a gradient update when token v appears in the current batch. If token v is rare — if it shows up once every million batches — its embedding gets updated once every million steps while "the" gets updated every single step.

Let's make this precise. With batch size B, sequence length T, and per-token squared gradient norm bounded by G2, the expected cumulative gradient signal after τ training steps is:

E[∑s=1τ ||∇ev Ls||2] ≤ τ · fv · B · T · G2

Where fv is the unigram probability of token v — how often it appears in the corpus. This is the key: gradient signal scales linearly with frequency.

A worked example: six orders of magnitude

Let's plug in real numbers from the paper. Training LLaMA-1B on 200 billion tokens with B = 8, T = 2048:

Token TierFrequency BinfvExpected Gradient Updates
Hapax (rarest)Bin 08.3 × 10-9~1,660
Near-hapaxBin 13.3 × 10-8~6,640
UncommonBin 28.3 × 10-8~16,600
Mid-frequencyBins 3-6~10-6~105-106
Common (highest)Bins 7-98.3 × 10-3~1.66 × 109

The rarest tokens get ~1,660 gradient updates over the entire training run. The most common tokens get 1.66 billion. That is a disparity of 106 — a million-fold difference in learning signal.

The ratio: For rare token v (fv = ε) and common token u (fu = c), the ratio of cumulative gradient signals is O(ε/c). In the paper's instantiation, ε/c ≈ 10-6. The rare token's embedding is essentially untrained noise compared to the common token's well-sculpted vector.

Empirical evidence: norm tells the story

How can we see this in a real model? The paper examines the L2 norm of embeddings in a trained LLaMA-Base-1B. Well-trained embeddings converge to high-norm, well-structured vectors. Under-trained embeddings stay low-norm and noisy — close to their random initialization.

The result is stark: mean embedding norm increases monotonically from 0.798 (rarest bin) to 1.549 (most common bin). The rarest tokens have embeddings that are 0.61× the norm of common tokens. Worse, the rare token norm distribution is wide and diffuse (noise-dominated), while common tokens cluster in a narrow peak (well-converged).

And this gap doesn't close with more training. Checkpoints at different points during training show rare token norms actually declining over time while common token norms keep growing. Gradient starvation is not a cold-start artifact — it is permanent.

Gradient Starvation Across Frequency Bins

Each bar shows expected gradient updates for a token frequency bin over 200B tokens of training. Note the logarithmic scale — the disparity is enormous.

Training Tokens (B) 200

Why can't we just train longer?

You might think: "If rare tokens need more gradient updates, just train on more data." But the paper shows this makes things worse, not better. The norm growth rate per 50B additional tokens is negative for rare tokens and positive for common ones. More training actively widens the gap because the common tokens keep pulling the loss landscape in their direction while rare tokens are noise-dominated and get further marginalized.

The fundamental issue: The problem isn't that we don't train long enough. The problem is architectural. A single embedding table with frequency-proportional gradient flow is structurally incapable of serving all tokens equally. We need more gradient pathways.
A token appears in 1 out of every 10,000 training batches. Another token appears in every batch. Over 1 million training steps, approximately how many times more gradient signal does the common token receive?

Chapter 2: Contextual Collapse

Gradient starvation is about training — rare tokens never learn good embeddings. But there is a second, deeper failure mode that afflicts even well-trained tokens when they share syntactic context. The paper calls it Contextual Collapse.

Consider these two sentences:

"The treaty was signed in 1847 after years of negotiation."
"The treaty was signed in 1849 after years of negotiation."

The tokens "1847" and "1849" are semantically distinct — they refer to different years, different events. But their syntactic context is identical: both appear after "signed in" and before "after years of." Attention operates on context, so it produces nearly identical outputs for both positions. The hidden states converge.

Three categories of collapse

The paper identifies three canonical categories where contextual collapse occurs:

CategoryExample PairsWhy They Collapse
Grammatical homophonestheir/there, it's/its, who/whomIdentical syntactic slot, different semantics
Numeric identity tokens1847/1849, 100/1000Numbers in identical templates are context-identical
Rare domain tokensibuprofen/acetaminophenRare + similar context = doubly indistinguishable

The Lipschitz bound: why the FFN can't help

You might think: "The feed-forward network after attention should be able to separate these tokens — it has millions of parameters." The paper proves this is mathematically impossible when the inputs are close enough.

An FFN is a continuous function with bounded Lipschitz constant LFFN. This means:

||FFN(hu) - FFN(hv)|| ≤ LFFN · ||hu - hv||

If two hidden states are close (||hu - hv|| ≤ δ), the FFN outputs must also be close, no matter what the weights are. It's a mathematical ceiling, not a practical one.

Now suppose the model needs to produce different outputs for tokens u and v — say, different next-token predictions. Let g be the target function with ||g(u) - g(v)|| = C > 0. The paper proves:

max(||FFN(hu) - g(u)||, ||FFN(hv) - g(v)||) ≥ (C - LFFN · δ) / 2

When C > LFFN · δ — when the target separation exceeds what the Lipschitz bound allows — this error is strictly positive. The FFN cannot approximate the target function on both tokens simultaneously, regardless of how many parameters it has or how wide the network is.

The structural impossibility: The FFN is a continuous function mapping from continuous inputs. If two inputs are close, the outputs must be close. Adding more parameters or making the FFN wider does not help — it changes LFFN but making LFFN large destabilizes all other tokens. This is not a capacity problem; it is a representational bottleneck.

Proof walkthrough

Let's walk through the proof step by step. We have collapsed tokens (u, v) with ||hu - hv|| ≤ δ.

Step 1: Lipschitz bound
Since FFN is Lipschitz: ||FFN(hu) - FFN(hv)|| ≤ LFFN · δ
Step 2: Triangle inequality
C = ||g(u) - g(v)|| ≤ ||g(u) - FFN(hu)|| + ||FFN(hu) - FFN(hv)|| + ||FFN(hv) - g(v)||
Step 3: Substitute the Lipschitz bound
C ≤ ||g(u) - FFN(hu)|| + LFFNδ + ||FFN(hv) - g(v)||
Step 4: Rearrange
||g(u) - FFN(hu)|| + ||FFN(hv) - g(v)|| ≥ C - LFFNδ
Step 5: Max ≥ half the sum
max(erru, errv) ≥ (C - LFFNδ) / 2 > 0 when C > LFFNδ

This is devastating. It says: at least one of the two tokens must be approximated incorrectly, and the error has a strict lower bound that no weight configuration can reduce to zero.

Contextual Collapse: Hidden State Distance Across Layers

Simulated L2 distance between hidden states of token pairs across transformer layers. Select a category to see collapse patterns from the paper's Figure 2.

Category

The heatmap shows what the paper found empirically in LLaMA-Base-1B: for all three categories, the L2 distance between hidden states stays near zero through most of the network. The tokens are indistinguishable to the FFN — it cannot tell "1847" from "1849" no matter how hard it tries. Only at the very last few layers does any separation emerge, and for numeric tokens, even the final layer shows significant collapse.

The key contrast with gradient starvation: Gradient starvation is about training signal — rare tokens don't learn good embeddings. Contextual collapse is about inference-time representation — even well-trained tokens become indistinguishable when context is shared. They are two sides of the same coin: the single-injection assumption.
Why can't making the FFN wider (more parameters) solve contextual collapse?

Chapter 3: The TIDE Architecture

We now understand the two failures: gradient starvation starves rare token embeddings, and contextual collapse makes the FFN blind to tokens with similar context. Both stem from the same root cause — the token index is consulted once and discarded.

TIDE's fix is elegantly simple: give every layer persistent access to token identity. Not through the contextualized hidden state, but through a parallel pathway that indexes the token directly, bypassing attention and the FFN entirely.

The three components

TIDE adds three components to the standard transformer:

1. EmbeddingMemory
K independent embedding tables (MemoryBlocks), each mapping token indices to learned vectors. Computed once per forward pass, shared across all L layers.
2. Depth-Conditioned Router
Each layer has a lightweight linear router that generates softmax weights over the K memory blocks, deciding how to combine them at this specific depth.
3. Null Bank
A (K+1)-th slot in the router that always outputs the zero vector. This is a learned "off switch" — when the router assigns all weight to the null bank, TIDE recovers the standard transformer.

How a TIDE layer works

A standard transformer layer computes:

l = hl-1 + Attn(RMSNorm(hl-1))
hl = h̃l + FFN(RMSNorm(h̃l))

A TIDE layer inserts the memory signal inside the FFN's input (after RMSNorm, before the FFN):

αl = softmax(Wrl · ñl) ∈ RK+1
ml(v) = ∑k=1K+1 αkl Mk(v)
hl = h̃l + FFN(ñl + ml(v))

Where ñl = RMSNorm(h̃l) is the post-attention normalized hidden state, and v is the original token index at this position.

Critical detail: The memory vector ml(v) is added to the FFN's input (after normalization), not to the residual stream after the FFN. This means the FFN sees the token-identity signal and can use it to produce different outputs for different tokens, even when their contextualized hidden states are identical. The Lipschitz bottleneck is bypassed because the FFN input now depends on the discrete token index, not just the continuous hidden state.

Data flow: a concrete example

Let's trace through a TIDE layer for the token "1847" (index v = 42091) with K = 8 memory blocks, model dimension d = 2048, memory dimension db = 256:

StepOperationShape
1Post-attention hidden state h̃l[B, T, 2048]
2Normalize: ñl = RMSNorm(h̃l)[B, T, 2048]
3Router: αl = softmax(Wrll)Wr: [9, 2048] → α: [B, T, 9]
4Memory lookup: Mk(v) = RMSNorm(Ek[v])Each: [B, T, 256], already computed
5Weighted sum: ml(v) = ∑ αk Mk(v)[B, T, 256]
6FFN input: ñl + ml(v)[B, T, 2048] (after projection)
7hl = h̃l + FFN(ñl + ml(v))[B, T, 2048]

The memory lookup (Step 4) was computed once at the start of the forward pass and is simply indexed — no matrix multiplication, no attention, no gradient through context. Just a table lookup by token index. This is what makes TIDE fundamentally different from retrieval-augmented approaches: the memory is not queried by content, it is indexed by identity.

TIDE Architecture — Interactive Data Flow

Click "Step Through" to watch data flow through a TIDE layer. Toggle memory off to see a standard transformer layer. Toggle back to see how memory injection changes the FFN input.

Ready. Click "Step Through" to begin.
Where is the memory vector ml(v) injected in a TIDE layer?

Chapter 4: EmbeddingMemory

The EmbeddingMemory is the core of TIDE. It is deceptively simple: K separate embedding tables, each mapping every token index to a learned vector. But the design decisions behind it are subtle and important.

Structure of a MemoryBlock

Each MemoryBlock k maintains its own embedding table Ek ∈ R|V| × db. For a token with index v, the output is:

Mk(v) = RMSNorm(Ek[v]) ∈ Rdb

That's it. A table lookup followed by normalization. No matrix multiplications, no attention, no activation functions. Each MemoryBlock is just an embedding table with RMSNorm applied.

Why K independent tables instead of one big one?

You might ask: why not just make one larger embedding table and have the router select parts of it? The answer is gradient flow. Each of the K tables provides an independent gradient pathway into the token's representation. When token v appears in a batch, all K embedding tables receive gradients simultaneously through K different computational paths.

A single table of K · db dimensions would give the token one gradient vector. K separate tables of db dimensions give it K gradient vectors. This is the mechanism behind K-fold gradient amplification, which we will formalize in Chapter 6.

Think of it this way: Imagine you're learning a new word. With one embedding table, you get one teacher giving you one perspective on what the word means. With K MemoryBlocks, you get K different teachers, each developing an independent understanding of the word. Even if each individual teacher is weaker (db < d), the ensemble is stronger because the gradient signal is K-fold amplified.

No parameter sharing

A critical design choice: the K MemoryBlocks share no parameters with each other or with the primary embedding table E. This is deliberate. The paper verifies empirically (Figure 9 in the paper) that after training, the cosine distance between the primary embedding E and each MemoryBlock Mk ranges from 0.65 to 0.99 — they are highly distinct. The blocks don't degenerate into copies of E; they learn complementary token-identity signals.

PropertyPrimary Embedding EMemoryBlock Mk
Dimensiond (model hidden dim, e.g., 2048)db (smaller, e.g., 256)
Used atLayer 0 onlyEvery layer (via router)
Input toResidual stream h(0)FFN input (additive)
Gradient sourceSingle path through residual streamK independent paths through K blocks
Context-dependent?No (pure lookup)No (pure lookup, router decides weight)

Computed once, shared everywhere

The memory tensor is computed once at the start of each forward pass:

M = Stackk(Mk(x)) ∈ RB × T × K × db

This tensor is then indexed at every layer — no recomputation. Each layer only needs to compute the router weights αl (a cheap linear projection + softmax) and take a weighted sum of the pre-computed memory vectors. The memory lookup itself is O(1) — identical cost to the original embedding lookup at Layer 0.

MemoryBlock Diversity: Cosine Distance Heatmap

This heatmap shows mean cosine distance between embedding spaces (primary E and 8 MemoryBlocks) from a trained TIDE-8E-1B model. High values (brighter) mean more distinct representations.

The heatmap reveals two important findings. First, every MemoryBlock is highly distant from the primary embedding E (top row / left column are all bright), confirming the blocks learn genuinely new information rather than copying E. Second, inter-block distances are somewhat lower (the interior is slightly dimmer), suggesting the blocks converge to overlapping but non-collapsed subspaces — they are diverse enough to be useful but similar enough that the router can smoothly interpolate between them.

python
# Pseudocode for EmbeddingMemory forward pass
class MemoryBlock(nn.Module):
    def __init__(self, vocab_size, d_b):
        self.embed = nn.Embedding(vocab_size, d_b)
        self.norm = RMSNorm(d_b)

    def forward(self, token_ids):  # [B, T]
        return self.norm(self.embed(token_ids))  # [B, T, d_b]

class EmbeddingMemory(nn.Module):
    def __init__(self, K, vocab_size, d_b):
        self.blocks = nn.ModuleList([
            MemoryBlock(vocab_size, d_b) for _ in range(K)
        ])

    def forward(self, token_ids):  # [B, T]
        # Computed ONCE, shared across all layers
        return torch.stack([
            block(token_ids) for block in self.blocks
        ], dim=2)  # [B, T, K, d_b]
Why does TIDE use K separate MemoryBlocks instead of one larger embedding table of dimension K · db?

Chapter 5: The Router

The EmbeddingMemory provides K different views of each token's identity. But which views should each layer use? A deep layer processing abstract semantics might need different token-identity signals than an early layer processing surface syntax. This is the job of the depth-conditioned router.

How the router works

At each layer l, the router takes the post-attention normalized hidden state and projects it to K+1 logits:

αl = softmax(Wrl · ñl) ∈ RK+1

Where Wrl ∈ R(K+1) × d is a per-layer weight matrix. The softmax ensures αkl > 0 and ∑ αkl = 1. This is the cheapest possible routing mechanism — a single matrix multiply followed by softmax. No MLP, no gating, no top-k selection.

The null bank: a learned "off switch"

The (K+1)-th slot is special. Its MemoryBlock output is always zero: MK+1(v) = 0 for all tokens v. This is the null bank. It has no parameters — it is just the zero vector.

Why include it? Because the router uses softmax, it must assign all its probability mass somewhere. Without the null bank, the router is forced to always inject some memory signal, even when the contextual residual stream is sufficient. The null bank gives the router a way to say "no memory needed here" by assigning weight to the zero slot.

The null bank is why TIDE generalizes standard transformers. If the router assigns all weight to the null bank at every layer, ml(v) = 0 everywhere, and TIDE degenerates to the standard transformer. The model can learn to turn off the memory pathway entirely. This means TIDE can never be worse than the baseline — it strictly adds representational capacity.

What the router learns: empirical patterns

The paper analyzes the trained router weights in TIDE-8E-1B and finds striking patterns. For the last layer, stratified by token frequency:

Token FrequencyNull Bank Weight αnullActive Memory Weight 1 - αnull
Rarest decile (0-10%)0.5300.470
10-20%0.5290.471
30-40%0.7090.291
50-60%0.7650.235
70-80%0.7850.215
Most common (90-100%)0.8890.111

This is remarkable. The router has learned, without any explicit frequency signal, to open the gate wide for rare tokens (47% active memory) and nearly close it for common tokens (only 11% active memory). The null bank weight monotonically increases with token frequency — exactly what the theory predicts. Common tokens have well-trained embeddings and don't need the memory pathway. Rare tokens desperately need it.

Block specialization

Even more interesting: among the active (non-null) memory blocks, the router weights are non-uniform. In TIDE-8E-1B, block M5 carries an outsized share for rare tokens (α5 ≈ 0.31 on the rarest decile) while being nearly zero for common tokens. Block M2 specializes for mid-frequency tokens. The blocks don't redundantly co-fire — they specialize for different frequency regimes.

Router Weight Heatmap Across Layers and Frequency Bins

Heatmap shows router weight assigned to active memory blocks (left) and null bank (right) across token frequency bins. Rare tokens (top rows) receive more memory; common tokens (bottom) are mostly routed to null.

View
python
# Router implementation for a single TIDE layer
class TIDERouter(nn.Module):
    def __init__(self, d_model, K):
        self.proj = nn.Linear(d_model, K + 1)  # K blocks + null bank

    def forward(self, normed_hidden, memory_tensor):
        # normed_hidden: [B, T, d_model]
        # memory_tensor: [B, T, K, d_b]  (precomputed)
        logits = self.proj(normed_hidden)  # [B, T, K+1]
        alpha = F.softmax(logits, dim=-1)  # [B, T, K+1]

        # Null bank: append zero vector as (K+1)-th slot
        B, T, K, d_b = memory_tensor.shape
        null = torch.zeros(B, T, 1, d_b, device=memory_tensor.device)
        memory_with_null = torch.cat([memory_tensor, null], dim=2)
        # memory_with_null: [B, T, K+1, d_b]

        # Weighted sum across K+1 slots
        alpha_expanded = alpha.unsqueeze(-1)  # [B, T, K+1, 1]
        m = (alpha_expanded * memory_with_null).sum(dim=2)  # [B, T, d_b]
        return m  # inject into FFN input
In the trained TIDE-8E-1B model, the null bank weight for the rarest token decile is 0.530, and for the most common decile it is 0.889. What does this tell us?

Chapter 6: Theory

TIDE makes three precise theoretical claims. Let's prove each one, step by step, with no hand-waving.

Claim 1: Asymptotic generalization

TIDE can approximate a standard transformer to arbitrary precision. In other words, TIDE is strictly at least as expressive as the baseline.

Proposition 3.1. For any ε > 0, there exist finite router parameters Wrl such that ||ml(v)|| < ε for all tokens v and all layers l.

The proof relies entirely on the null bank. Let's work through it.

Step 1: Router softmax decomposition
The router produces α = softmax(z), where z ∈ RK+1. Slot K+1 is the null bank with MK+1(v) = 0.
Step 2: Dominate the null logit
Set zK+1 = s (a large scalar) while keeping all other logits fixed. Then αK+1 = es / (∑k=1K ezk + es).
Step 3: Active bank weights vanish
k=1K αk = K / (K + es) → 0 as s → ∞.
Step 4: Bound the memory norm
||ml(v)|| = ||∑ αk Mk(v)|| ≤ (1 - αK+1) · C, where C = maxv,k ||Mk(v)|| < ∞ (RMSNorm bounds this).
Step 5: Solve for finite s*
Setting s* = log(K(C - ε)/ε) achieves ||ml(v)|| < ε. This is a finite number, so the router parameters are finite.

Let's verify with numbers. Suppose K = 8, C = 1.0 (RMSNorm outputs have unit norm), and we want ε = 0.001:

s* = log(8 · (1.0 - 0.001) / 0.001) = log(7992) ≈ 8.99

Setting the null bank logit to about 9.0 is enough to make the memory contribution negligible. This is a perfectly achievable parameter configuration.

What this means practically: TIDE can never be worse than the standard transformer. During training, if the memory pathway is not helpful, the router can learn to route everything to the null bank. The optimizer has full freedom to ignore the memory entirely. This is a "free lunch" guarantee — adding TIDE can only help.

Claim 2: K-fold gradient amplification

This is the core quantitative result. TIDE amplifies the gradient signal for every token by a factor of K.

Proposition 3.2. Under minibatch SGD, the total expected cumulative squared gradient norm across all K embedding tables for token v satisfies:

E[∑s=1τk=1K ||∇ev(k) Ls||2] ≥ K · τ · κv · G2min

Where κv = 1 - (1 - fv)BT ≈ fv · BT for small fv.

Let's derive this carefully.

Step 1: Token appears in batch
At step s, token v appears in the batch with probability κv = 1 - (1 - fv)BT. When it does, ALL K embedding tables receive a non-zero gradient because the forward pass routes through all K blocks (softmax weights are strictly positive for finite logits).
Step 2: Per-block gradient lower bound
For each block k, conditioned on v appearing in the batch: ||∇ev(k) Ls||2 ≥ G2min > 0. This holds because the router weight αk > 0 (softmax with finite logits is always positive), so the gradient through block k is non-degenerate.
Step 3: Sum across blocks
Conditioned on v in batch: ∑k=1K ||∇ev(k) Ls||2 ≥ K · G2min. The blocks are independent (no parameter sharing), so each contributes independently.
Step 4: Take expectation over τ steps
E[∑sk ||∇||2] ≥ ∑s=1τ κv · K · G2min = K · τ · κv · G2min.

Compare this to the standard transformer's bound: τ · fv · BT · G2 (a single gradient pathway). TIDE provides K times more gradient signal through K independent pathways.

Numerical example: For a hapax token (fv = 8.3 × 10-9) with K = 8 MemoryBlocks, TIDE provides 8× the gradient signal of the baseline. The token still only appears in ~1,660 batches over 200B tokens, but each appearance pushes 8 independent embedding tables simultaneously, each developing its own representation of this token.

Claim 3: Lipschitz bypass

Proposition 3.3. For a collapsed token pair (u, v) with ||hu - hv|| ≤ δ, the EmbeddingMemory can achieve any target separation C > 0 regardless of δ and LFFN.

This follows from a simple but powerful observation. The memory output Mk(v) = RMSNorm(Ek[v]) depends on the discrete token index v, not on the hidden state h. The hidden state collapses? The memory doesn't care — it looks up v directly.

||Mk(u) - Mk(v)|| = C   regardless of   ||hu - hv|| = δ

The embedding rows Ek[u] and Ek[v] are separate, uncoupled parameters. They can be set to any values independently, so the RMSNorm outputs can achieve any prescribed separation. This is fundamentally different from the FFN, which must map from continuous hidden states and is therefore subject to the Lipschitz bound.

The discrete bypass: The FFN operates on continuous hidden states — close inputs force close outputs (Lipschitz). The EmbeddingMemory operates on discrete token indices — two different tokens always have independent embeddings. This is why TIDE routes around the FFN bottleneck rather than fighting it.
Gradient Amplification: Standard vs TIDE

Compare cumulative gradient signal for a rare token (fv = 10-8) in a standard transformer (1 pathway) vs TIDE with K memory blocks. Drag the K slider to see the amplification.

K (MemoryBlocks) 8
Why can the EmbeddingMemory separate collapsed token pairs that the FFN cannot?

Chapter 7: Results

Theory gives us guarantees. Now let's see what TIDE actually delivers on real benchmarks at real scale.

The rare token payoff

The paper's most striking result is Figure 5: mean validation cross-entropy loss per frequency decile for LLaMA-Base-1B vs TIDE-8E-1B, both trained on 200B tokens.

TIDE improves on every single decile, but the gains are sharply asymmetric:

Frequency DecileLoss Reduction (nats)Relative Improvement
0-10% (rarest)0.7049.0%
10-20%0.5076.5%
20-30%0.3015.2%
30-40%0.1944.2%
40-50%0.1383.0%
50-60%0.1353.1%
60-70%0.1253.1%
70-80%0.1223.2%
80-90%0.1182.6%
90-100% (most common)0.0682.4%

The rare-to-common improvement ratio is 0.704/0.068 = 10.4×. TIDE helps rare tokens roughly ten times more than common tokens. This is exactly the signature of K-fold gradient amplification — the tokens that were most gradient-starved benefit the most from the additional gradient pathways.

Mean gain across bins: Rare tokens (bins 0-2) see a mean improvement of 0.504 nats. Common tokens (bins 7-9) see 0.104 nats. That is a 4.8× disparity in absolute gain — confirming the prediction from Proposition 3.2.

Perplexity improvements

On standard language modeling benchmarks with the 1B-scale model family:

ModelWikiText-2 PPL ↓PubMed PPL ↓DCLM PPL ↓
LLaMA-Base-1B~13.0~15.0~21.0
TIDE-2E-1B~12.5~14.3~20.0
TIDE-8E-1B~11.8~13.5~18.5
TIDE-16E-1B~11.5~13.2~18.0
TIDE-24E-1B~11.2~12.8~17.5

Improvement is monotonic in K — more MemoryBlocks always helps, and there is no saturation even at K = 24. The gains are substantial: TIDE-24E-1B reduces WikiText-2 perplexity by ~14% relative to the baseline.

Faster convergence

A remarkable finding: TIDE with just 2-4 MemoryBlocks at 100B training tokens matches the perplexity that the baseline reaches at 200B tokens. The additional gradient pathways translate directly to faster effective convergence — TIDE learns more from each training step.

Downstream benchmarks

Zero-shot accuracy across eight benchmarks at the 1B scale:

ModelARC-CARC-EBoolQHellaSwagLAMBADAPIQAAverage
LLaMA-Base-1B37.564.461.763.964.674.961.4
TIDE-8E-1B37.564.569.365.364.775.563.0
TIDE-24E-1B38.966.369.566.366.477.363.7

The average improves from 61.4 to 63.7 (+2.3 points absolute). Notable individual gains include BoolQ (+7.8 points), HellaSwag (+2.4), and PIQA (+2.4). The improvements scale consistently from 750M to 3B parameters, confirming TIDE is not a small-model trick.

Contextual collapse resolution

The paper revisits the three collapse categories from Chapter 2 and compares L2 separation between base and TIDE models. Across all categories, TIDE increases layer-wise separation, with the largest gains in the middle-to-terminal layers (where collapse is most severe). Numeric tokens — the worst collapse category — are the biggest beneficiary, seeing +26.1 mean L2 improvement.

Results Dashboard: LLaMA-Base vs TIDE Across Token Deciles

Bar chart comparing cross-entropy loss per frequency decile. Toggle between absolute loss and improvement delta.

View

Scaling behavior: K from 0 to 24

The paper decomposes held-out cross-entropy by token frequency as K increases from 0 (baseline) to 24. The rare-token loss slope is 3.7× steeper than the common-token slope — each additional MemoryBlock benefits rare tokens almost four times more than common tokens. Even K = 2 captures ~55% of the total rare-token improvement at K = 24, suggesting the bulk of the benefit is achievable with modest overhead.

TIDE-8E-1B improves the rarest decile's loss by 0.704 nats (9.0%) and the most common decile by 0.068 nats (2.4%). What is the rare-to-common gain ratio?

Chapter 8: Memory Efficiency

TIDE adds K embedding tables, each of size |V| × db. With |V| = 128,256 tokens and db = 256, each MemoryBlock is 128,256 × 256 = 32.8 million parameters. At K = 8, that's 262M extra parameters. At K = 24, it's 787M. How do we keep this manageable?

The key insight: memory tables are static

During inference, the EmbeddingMemory tables are read-only lookup tables. They are indexed by token identity — a discrete integer — and never modified. This means they have the same properties as the primary embedding table E: they are static, their access pattern is known in advance (determined by the input tokens), and they can be heavily compressed.

Think of it this way: The EmbeddingMemory tables are like dictionaries. During inference, you look up K entries per token per forward pass. This is fundamentally different from the FFN, whose activations depend on the continuous hidden state and must be computed in real-time with full-precision matrix multiplications.

4-bit quantization

The paper demonstrates that the EmbeddingMemory tables can be quantized to 4-bit precision with negligible performance impact. Why does this work so well?

Each MemoryBlock output passes through RMSNorm, which normalizes the vector to approximately unit norm. The downstream computation is a weighted sum followed by addition to the FFN input. The model is robust to small perturbations in this additive signal — the FFN was trained to work with approximate memory vectors, not exact ones.

At 4-bit precision:

Memory per block = |V| × db × 4 bits = 128,256 × 256 × 0.5 bytes = 16.4 MB

For K = 8: 131 MB total. For K = 24: 394 MB total. These are small numbers.

SSD offloading

The memory tables can be offloaded to SSD (solid-state drive) rather than kept in VRAM. Since the access pattern is determined by the input token sequence (known before the forward pass begins), the tables can be prefetched asynchronously. The latency of SSD access (~100-500 μs) is hidden behind the GPU computation of attention and FFN.

VRAM budget breakdown

The paper provides a detailed breakdown for each TIDE variant:

ModelVRAM Params (8-bit)SSD Params (4-bit)Total Params
LLaMA-Base-1B1.028 GB0 GB1.03B
TIDE-2E-1B1.028 GB0.263 GB1.05B
TIDE-4E-1B1.028 GB0.525 GB1.05B
TIDE-8E-1B1.028 GB1.051 GB1.03B + 0.53B mem
TIDE-16E-1B1.028 GB2.101 GB1.03B + 1.05B mem
TIDE-24E-1B1.028 GB3.152 GB1.03B + 1.58B mem

The critical column is VRAM: it stays at 1.028 GB for every TIDE variant, identical to the baseline. The only cost is SSD storage, which is cheap. The VRAM footprint — the actual constraint for GPU deployment — is unchanged.

The free lunch: TIDE-8E-1B delivers 9% improvement on rare tokens and 2.4% on common tokens while using the exact same VRAM as LLaMA-Base-1B. The only overhead is 1.05 GB of SSD storage for the 4-bit quantized MemoryBlocks. For K = 2 (which captures ~55% of total rare-token improvement), the SSD cost is just 263 MB.
Memory Budget Calculator

Adjust K (number of MemoryBlocks) and see the VRAM vs SSD breakdown. VRAM stays constant while SSD scales linearly with K.

K (MemoryBlocks) 8
Vocab Size (thousands) 128K

Inference overhead

Beyond memory, what about compute? Each TIDE layer adds:

OperationCostRelative to FFN
Router projection(K+1) × d multiplyNegligible (~0.1%)
Softmax over K+1O(K) per tokenNegligible
Weighted sum of K vectorsK × db multiply-addNegligible (~0.5%)
Memory lookupK table lookups (prefetched)Hidden behind GPU compute

The total additional FLOPs per layer are dominated by the router projection: (K+1) × d multiplications. For K = 8, d = 2048, that's 9 × 2048 = 18,432 FLOPs. Compare to the FFN's ~2 × d × 4d = 2 × 2048 × 8192 = 33.6M FLOPs. The router is less than 0.06% of the FFN cost.

python
# Complete TIDE layer pseudocode
class TIDELayer(nn.Module):
    def __init__(self, d_model, d_b, K, n_heads):
        self.attn = MultiHeadAttention(d_model, n_heads)
        self.ffn = SiLUGatedFFN(d_model)
        self.norm1 = RMSNorm(d_model)
        self.norm2 = RMSNorm(d_model)
        self.router = nn.Linear(d_model, K + 1)  # tiny
        self.mem_proj = nn.Linear(d_b, d_model)  # if d_b != d_model

    def forward(self, h, memory):
        # memory: [B, T, K, d_b] — precomputed

        # Attention block (unchanged)
        h_tilde = h + self.attn(self.norm1(h))

        # Router: which memory blocks to use at this depth?
        n_tilde = self.norm2(h_tilde)   # [B, T, d]
        alpha = F.softmax(self.router(n_tilde), dim=-1)  # [B, T, K+1]

        # Memory injection
        B, T, K, d_b = memory.shape
        null = torch.zeros(B, T, 1, d_b, device=memory.device)
        mem_null = torch.cat([memory, null], dim=2)
        m = (alpha.unsqueeze(-1) * mem_null).sum(dim=2)  # [B, T, d_b]
        m = self.mem_proj(m)  # [B, T, d_model]

        # FFN with memory-augmented input
        h = h_tilde + self.ffn(n_tilde + m)
        return h
TIDE-8E-1B adds ~530M parameters via MemoryBlocks. How much additional VRAM does it require compared to LLaMA-Base-1B at inference time?

Chapter 9: Connections

TIDE doesn't exist in isolation. It connects to several important threads in the LLM landscape. Understanding these connections helps you see where the field is heading.

RoPE: position identity at every layer

The closest analog to TIDE is Rotary Position Embeddings (RoPE). RoPE re-injects position information at every attention layer by rotating the query and key vectors. Without RoPE, the model would need to infer position from the residual stream alone — and it would suffer the same kind of "positional collapse" that TIDE fixes for token identity.

The parallel is exact:

PropertyRoPE (Position)TIDE (Token Identity)
What is injectedPosition index → rotation matrixToken index → memory vector
WhereEvery attention layer (Q, K)Every FFN input
HowMultiplicative rotationAdditive fusion
Without itPosition lost in deep layersToken identity lost in deep layers
Discrete input?Yes (position integer)Yes (token index integer)
TIDE's thesis in one sentence: Token identity deserves the same persistent, per-layer re-injection that position already gets via RoPE. The standard transformer gives position persistent identity but gives token identity a single shot.

Knowledge neurons and FFN-as-memory

A series of papers (Geva et al. 2021, 2022; Dai et al. 2022; Meng et al. 2022) established that FFN layers in transformers function as key-value memories. The first FFN layer acts as a pattern detector (keys) and the second projects specific information into the residual stream (values). Specific neurons were identified as "knowledge neurons" storing individual facts.

TIDE addresses a limitation of this paradigm: FFN-as-memory is indexed by the continuous hidden state, which means it inherits the Lipschitz bottleneck. When two tokens produce similar hidden states, the FFN retrieves similar "memories" for both, even if they need different factual information. TIDE's EmbeddingMemory is indexed by discrete token identity, bypassing this entirely.

Memory-augmented architectures

The lineage traces back to Memory Networks (Weston et al. 2014), End-to-End Memory Networks (Sukhbaatar et al. 2015), and Neural Turing Machines (Graves et al. 2014, 2016). These augment neural networks with external read-write memory banks. Product-key networks (Lample et al. 2019) improved scaling with efficient memory retrieval.

TIDE differs from all of these in a crucial way: its memory is not queried by content but by identity. Traditional memory networks use an attention-like mechanism to match queries against memory keys. TIDE simply looks up the token index — no matching, no scoring, no key-value attention. This makes TIDE O(1) per lookup rather than O(memory_size).

Retrieval-augmented generation (RAG)

RAG systems (Lewis et al. 2020; Borgeaud et al. 2022; Izacard et al. 2023) augment LLMs with external knowledge retrieved at inference time. TIDE is fundamentally different: its memory is internal (part of the model parameters), trained end-to-end (not a separate retrieval system), and indexed by token identity (not by semantic similarity).

However, TIDE and RAG are complementary. RAG provides external, updateable knowledge. TIDE provides persistent token-level identity. A TIDE model with RAG would have both a strong internal memory (MemoryBlocks) and access to external corpora.

Sparse autoencoders and interpretability

Sparse autoencoders (SAEs) decompose transformer activations into interpretable features. TIDE's MemoryBlocks may be more directly interpretable than FFN activations because each block provides a clean, token-indexed vector that can be examined in isolation. The paper's analysis of cosine distances between blocks suggests each block encodes a distinct "aspect" of token identity — an intriguing direction for mechanistic interpretability.

MoLE, MemoryLLM, and STEM

TIDE builds on several concurrent works. MoLE (Jie et al. 2025) showed that in mixture-of-experts models, most experts can be trained directly with token-level embeddings. MemoryLLM (Jaiswal et al. 2026) completely decouples FFNs from the contextual residual stream by training layer-local, token-indexed embedding tables for interpretability. STEM (Sadhukhan et al. 2026) partially replaces FFN up-projections with embedding table lookups. TIDE unifies and extends these ideas with its global EmbeddingMemory shared across all layers, depth-conditioned routing, and null bank.

Limitations and open questions

LimitationPotential Direction
Memory tables grow linearly with vocab sizeStructured embeddings (product quantization, LSH)
Tested up to 3B scaleLarger-scale experiments (7B, 70B) may reveal saturation
Fixed K across all layersAdaptive K that varies by depth
Static memory at inferenceOnline memory updates for continual learning
Trains from scratch onlyRetrofit into existing pretrained models
The big picture: TIDE is part of a broader trend in transformer architecture design: giving the model explicit, persistent access to information that was previously only available implicitly through the residual stream. RoPE did this for position. TIDE does it for token identity. The next question is: what other information is the model losing track of in deep layers?
What is the closest existing mechanism to TIDE in standard transformer architectures?
"What I cannot create, I do not understand." — Richard Feynman