Standard transformers look up the token embedding once and throw it away. Every subsequent layer flies blind — it processes context but never re-consults which token it is working on. TIDE fixes this by giving every layer persistent, context-free access to token identity.
Picture yourself in a conversation. Someone says "I saw a crane by the river." You hear the word "crane" once, form a mental image — maybe a bird, maybe construction equipment — and then the word dissolves. For the rest of the conversation, you reason about this crane entirely through context: "it flew away," "it lifted the beam." But you never go back and re-read the original word.
This is exactly how a standard transformer processes language. It looks up each token in an embedding table exactly once — at Layer 0 — and then permanently discards the token index. Every subsequent layer operates on a contextualized hidden state: a blend of information from all surrounding tokens, mixed through attention and feed-forward networks. The original token identity? Gone.
For most tokens, this works fine. "The" and "cat" appear so often that their embeddings are well-trained, and context is usually sufficient to distinguish them. But what about the word "acetaminophen"? Or the number "1847"? Or the rare proper noun "Kvothe"?
Let's trace what happens inside a standard transformer. The input sequence "Jack loves his coffee with sugar" enters the model as token indices: [4521, 8923, 1104, 7834, 1205, 9122]. Each index is looked up in an embedding table E to produce a vector:
This is the only moment the model has direct access to which token it is processing. After this, Layer 1 computes attention over all positions, mixes information, and produces a new hidden state. Layer 2 does the same. By Layer 16 of a deep transformer, the hidden state at position 4 (originally "coffee") is a complex mixture of information from every token in the sequence.
The token index 7834 ("coffee") was consulted once and never again. Compare this to position, which is re-injected at every single attention layer via RoPE (Rotary Position Embeddings). Position gets persistent identity. Token identity gets a single shot.
The canvas below simulates a simplified transformer. Watch how the "token identity signal" — the model's ability to distinguish which token is at each position purely from the hidden state — fades as you go deeper. By the middle layers, tokens with similar contexts become nearly indistinguishable.
Each column is a token. Color opacity represents how much the hidden state retains the original token's unique identity signal. Watch it fade.
This fading is not a bug — it is a feature for common tokens. Contextual mixing is what makes transformers powerful. But for rare tokens whose embeddings are poorly trained, and for distinct tokens that happen to share syntactic context, the loss of token identity is catastrophic.
The 2026 paper by Jaiswal et al. at Apple identifies two specific failure modes caused by this single-injection design, proposes an elegant architectural fix called TIDE (Token Identity Delivered Everywhere), and proves theoretically why it works. Let's build it from scratch.
Open any book, any Wikipedia article, any codebase. Count how often each word appears. You will discover one of the most universal laws in language: Zipf's Law. The most frequent word ("the") appears roughly twice as often as the second most frequent ("of"), three times as often as the third, and so on. A tiny fraction of the vocabulary dominates the corpus.
For a modern LLM tokenizer like LLaMA-3 with |V| = 128,256 tokens, this means the top 1% of tokens account for roughly 80% of all occurrences. The bottom 10%? They might appear once every few billion tokens.
Here is the problem. Under minibatch SGD, a token embedding ev only receives a gradient update when token v appears in the current batch. If token v is rare — if it shows up once every million batches — its embedding gets updated once every million steps while "the" gets updated every single step.
Let's make this precise. With batch size B, sequence length T, and per-token squared gradient norm bounded by G2, the expected cumulative gradient signal after τ training steps is:
Where fv is the unigram probability of token v — how often it appears in the corpus. This is the key: gradient signal scales linearly with frequency.
Let's plug in real numbers from the paper. Training LLaMA-1B on 200 billion tokens with B = 8, T = 2048:
| Token Tier | Frequency Bin | fv | Expected Gradient Updates |
|---|---|---|---|
| Hapax (rarest) | Bin 0 | 8.3 × 10-9 | ~1,660 |
| Near-hapax | Bin 1 | 3.3 × 10-8 | ~6,640 |
| Uncommon | Bin 2 | 8.3 × 10-8 | ~16,600 |
| Mid-frequency | Bins 3-6 | ~10-6 | ~105-106 |
| Common (highest) | Bins 7-9 | 8.3 × 10-3 | ~1.66 × 109 |
The rarest tokens get ~1,660 gradient updates over the entire training run. The most common tokens get 1.66 billion. That is a disparity of 106 — a million-fold difference in learning signal.
How can we see this in a real model? The paper examines the L2 norm of embeddings in a trained LLaMA-Base-1B. Well-trained embeddings converge to high-norm, well-structured vectors. Under-trained embeddings stay low-norm and noisy — close to their random initialization.
The result is stark: mean embedding norm increases monotonically from 0.798 (rarest bin) to 1.549 (most common bin). The rarest tokens have embeddings that are 0.61× the norm of common tokens. Worse, the rare token norm distribution is wide and diffuse (noise-dominated), while common tokens cluster in a narrow peak (well-converged).
And this gap doesn't close with more training. Checkpoints at different points during training show rare token norms actually declining over time while common token norms keep growing. Gradient starvation is not a cold-start artifact — it is permanent.
Each bar shows expected gradient updates for a token frequency bin over 200B tokens of training. Note the logarithmic scale — the disparity is enormous.
You might think: "If rare tokens need more gradient updates, just train on more data." But the paper shows this makes things worse, not better. The norm growth rate per 50B additional tokens is negative for rare tokens and positive for common ones. More training actively widens the gap because the common tokens keep pulling the loss landscape in their direction while rare tokens are noise-dominated and get further marginalized.
Gradient starvation is about training — rare tokens never learn good embeddings. But there is a second, deeper failure mode that afflicts even well-trained tokens when they share syntactic context. The paper calls it Contextual Collapse.
Consider these two sentences:
"The treaty was signed in 1847 after years of negotiation." "The treaty was signed in 1849 after years of negotiation."
The tokens "1847" and "1849" are semantically distinct — they refer to different years, different events. But their syntactic context is identical: both appear after "signed in" and before "after years of." Attention operates on context, so it produces nearly identical outputs for both positions. The hidden states converge.
The paper identifies three canonical categories where contextual collapse occurs:
| Category | Example Pairs | Why They Collapse |
|---|---|---|
| Grammatical homophones | their/there, it's/its, who/whom | Identical syntactic slot, different semantics |
| Numeric identity tokens | 1847/1849, 100/1000 | Numbers in identical templates are context-identical |
| Rare domain tokens | ibuprofen/acetaminophen | Rare + similar context = doubly indistinguishable |
You might think: "The feed-forward network after attention should be able to separate these tokens — it has millions of parameters." The paper proves this is mathematically impossible when the inputs are close enough.
An FFN is a continuous function with bounded Lipschitz constant LFFN. This means:
If two hidden states are close (||hu - hv|| ≤ δ), the FFN outputs must also be close, no matter what the weights are. It's a mathematical ceiling, not a practical one.
Now suppose the model needs to produce different outputs for tokens u and v — say, different next-token predictions. Let g be the target function with ||g(u) - g(v)|| = C > 0. The paper proves:
When C > LFFN · δ — when the target separation exceeds what the Lipschitz bound allows — this error is strictly positive. The FFN cannot approximate the target function on both tokens simultaneously, regardless of how many parameters it has or how wide the network is.
Let's walk through the proof step by step. We have collapsed tokens (u, v) with ||hu - hv|| ≤ δ.
This is devastating. It says: at least one of the two tokens must be approximated incorrectly, and the error has a strict lower bound that no weight configuration can reduce to zero.
Simulated L2 distance between hidden states of token pairs across transformer layers. Select a category to see collapse patterns from the paper's Figure 2.
The heatmap shows what the paper found empirically in LLaMA-Base-1B: for all three categories, the L2 distance between hidden states stays near zero through most of the network. The tokens are indistinguishable to the FFN — it cannot tell "1847" from "1849" no matter how hard it tries. Only at the very last few layers does any separation emerge, and for numeric tokens, even the final layer shows significant collapse.
We now understand the two failures: gradient starvation starves rare token embeddings, and contextual collapse makes the FFN blind to tokens with similar context. Both stem from the same root cause — the token index is consulted once and discarded.
TIDE's fix is elegantly simple: give every layer persistent access to token identity. Not through the contextualized hidden state, but through a parallel pathway that indexes the token directly, bypassing attention and the FFN entirely.
TIDE adds three components to the standard transformer:
A standard transformer layer computes:
A TIDE layer inserts the memory signal inside the FFN's input (after RMSNorm, before the FFN):
Where ñl = RMSNorm(h̃l) is the post-attention normalized hidden state, and v is the original token index at this position.
Let's trace through a TIDE layer for the token "1847" (index v = 42091) with K = 8 memory blocks, model dimension d = 2048, memory dimension db = 256:
| Step | Operation | Shape |
|---|---|---|
| 1 | Post-attention hidden state h̃l | [B, T, 2048] |
| 2 | Normalize: ñl = RMSNorm(h̃l) | [B, T, 2048] |
| 3 | Router: αl = softmax(Wrl ñl) | Wr: [9, 2048] → α: [B, T, 9] |
| 4 | Memory lookup: Mk(v) = RMSNorm(Ek[v]) | Each: [B, T, 256], already computed |
| 5 | Weighted sum: ml(v) = ∑ αk Mk(v) | [B, T, 256] |
| 6 | FFN input: ñl + ml(v) | [B, T, 2048] (after projection) |
| 7 | hl = h̃l + FFN(ñl + ml(v)) | [B, T, 2048] |
The memory lookup (Step 4) was computed once at the start of the forward pass and is simply indexed — no matrix multiplication, no attention, no gradient through context. Just a table lookup by token index. This is what makes TIDE fundamentally different from retrieval-augmented approaches: the memory is not queried by content, it is indexed by identity.
Click "Step Through" to watch data flow through a TIDE layer. Toggle memory off to see a standard transformer layer. Toggle back to see how memory injection changes the FFN input.
The EmbeddingMemory is the core of TIDE. It is deceptively simple: K separate embedding tables, each mapping every token index to a learned vector. But the design decisions behind it are subtle and important.
Each MemoryBlock k maintains its own embedding table Ek ∈ R|V| × db. For a token with index v, the output is:
That's it. A table lookup followed by normalization. No matrix multiplications, no attention, no activation functions. Each MemoryBlock is just an embedding table with RMSNorm applied.
You might ask: why not just make one larger embedding table and have the router select parts of it? The answer is gradient flow. Each of the K tables provides an independent gradient pathway into the token's representation. When token v appears in a batch, all K embedding tables receive gradients simultaneously through K different computational paths.
A single table of K · db dimensions would give the token one gradient vector. K separate tables of db dimensions give it K gradient vectors. This is the mechanism behind K-fold gradient amplification, which we will formalize in Chapter 6.
A critical design choice: the K MemoryBlocks share no parameters with each other or with the primary embedding table E. This is deliberate. The paper verifies empirically (Figure 9 in the paper) that after training, the cosine distance between the primary embedding E and each MemoryBlock Mk ranges from 0.65 to 0.99 — they are highly distinct. The blocks don't degenerate into copies of E; they learn complementary token-identity signals.
| Property | Primary Embedding E | MemoryBlock Mk |
|---|---|---|
| Dimension | d (model hidden dim, e.g., 2048) | db (smaller, e.g., 256) |
| Used at | Layer 0 only | Every layer (via router) |
| Input to | Residual stream h(0) | FFN input (additive) |
| Gradient source | Single path through residual stream | K independent paths through K blocks |
| Context-dependent? | No (pure lookup) | No (pure lookup, router decides weight) |
The memory tensor is computed once at the start of each forward pass:
This tensor is then indexed at every layer — no recomputation. Each layer only needs to compute the router weights αl (a cheap linear projection + softmax) and take a weighted sum of the pre-computed memory vectors. The memory lookup itself is O(1) — identical cost to the original embedding lookup at Layer 0.
This heatmap shows mean cosine distance between embedding spaces (primary E and 8 MemoryBlocks) from a trained TIDE-8E-1B model. High values (brighter) mean more distinct representations.
The heatmap reveals two important findings. First, every MemoryBlock is highly distant from the primary embedding E (top row / left column are all bright), confirming the blocks learn genuinely new information rather than copying E. Second, inter-block distances are somewhat lower (the interior is slightly dimmer), suggesting the blocks converge to overlapping but non-collapsed subspaces — they are diverse enough to be useful but similar enough that the router can smoothly interpolate between them.
python # Pseudocode for EmbeddingMemory forward pass class MemoryBlock(nn.Module): def __init__(self, vocab_size, d_b): self.embed = nn.Embedding(vocab_size, d_b) self.norm = RMSNorm(d_b) def forward(self, token_ids): # [B, T] return self.norm(self.embed(token_ids)) # [B, T, d_b] class EmbeddingMemory(nn.Module): def __init__(self, K, vocab_size, d_b): self.blocks = nn.ModuleList([ MemoryBlock(vocab_size, d_b) for _ in range(K) ]) def forward(self, token_ids): # [B, T] # Computed ONCE, shared across all layers return torch.stack([ block(token_ids) for block in self.blocks ], dim=2) # [B, T, K, d_b]
The EmbeddingMemory provides K different views of each token's identity. But which views should each layer use? A deep layer processing abstract semantics might need different token-identity signals than an early layer processing surface syntax. This is the job of the depth-conditioned router.
At each layer l, the router takes the post-attention normalized hidden state and projects it to K+1 logits:
Where Wrl ∈ R(K+1) × d is a per-layer weight matrix. The softmax ensures αkl > 0 and ∑ αkl = 1. This is the cheapest possible routing mechanism — a single matrix multiply followed by softmax. No MLP, no gating, no top-k selection.
The (K+1)-th slot is special. Its MemoryBlock output is always zero: MK+1(v) = 0 for all tokens v. This is the null bank. It has no parameters — it is just the zero vector.
Why include it? Because the router uses softmax, it must assign all its probability mass somewhere. Without the null bank, the router is forced to always inject some memory signal, even when the contextual residual stream is sufficient. The null bank gives the router a way to say "no memory needed here" by assigning weight to the zero slot.
The paper analyzes the trained router weights in TIDE-8E-1B and finds striking patterns. For the last layer, stratified by token frequency:
| Token Frequency | Null Bank Weight αnull | Active Memory Weight 1 - αnull |
|---|---|---|
| Rarest decile (0-10%) | 0.530 | 0.470 |
| 10-20% | 0.529 | 0.471 |
| 30-40% | 0.709 | 0.291 |
| 50-60% | 0.765 | 0.235 |
| 70-80% | 0.785 | 0.215 |
| Most common (90-100%) | 0.889 | 0.111 |
This is remarkable. The router has learned, without any explicit frequency signal, to open the gate wide for rare tokens (47% active memory) and nearly close it for common tokens (only 11% active memory). The null bank weight monotonically increases with token frequency — exactly what the theory predicts. Common tokens have well-trained embeddings and don't need the memory pathway. Rare tokens desperately need it.
Even more interesting: among the active (non-null) memory blocks, the router weights are non-uniform. In TIDE-8E-1B, block M5 carries an outsized share for rare tokens (α5 ≈ 0.31 on the rarest decile) while being nearly zero for common tokens. Block M2 specializes for mid-frequency tokens. The blocks don't redundantly co-fire — they specialize for different frequency regimes.
Heatmap shows router weight assigned to active memory blocks (left) and null bank (right) across token frequency bins. Rare tokens (top rows) receive more memory; common tokens (bottom) are mostly routed to null.
python # Router implementation for a single TIDE layer class TIDERouter(nn.Module): def __init__(self, d_model, K): self.proj = nn.Linear(d_model, K + 1) # K blocks + null bank def forward(self, normed_hidden, memory_tensor): # normed_hidden: [B, T, d_model] # memory_tensor: [B, T, K, d_b] (precomputed) logits = self.proj(normed_hidden) # [B, T, K+1] alpha = F.softmax(logits, dim=-1) # [B, T, K+1] # Null bank: append zero vector as (K+1)-th slot B, T, K, d_b = memory_tensor.shape null = torch.zeros(B, T, 1, d_b, device=memory_tensor.device) memory_with_null = torch.cat([memory_tensor, null], dim=2) # memory_with_null: [B, T, K+1, d_b] # Weighted sum across K+1 slots alpha_expanded = alpha.unsqueeze(-1) # [B, T, K+1, 1] m = (alpha_expanded * memory_with_null).sum(dim=2) # [B, T, d_b] return m # inject into FFN input
TIDE makes three precise theoretical claims. Let's prove each one, step by step, with no hand-waving.
TIDE can approximate a standard transformer to arbitrary precision. In other words, TIDE is strictly at least as expressive as the baseline.
Proposition 3.1. For any ε > 0, there exist finite router parameters Wrl such that ||ml(v)|| < ε for all tokens v and all layers l.
The proof relies entirely on the null bank. Let's work through it.
Let's verify with numbers. Suppose K = 8, C = 1.0 (RMSNorm outputs have unit norm), and we want ε = 0.001:
Setting the null bank logit to about 9.0 is enough to make the memory contribution negligible. This is a perfectly achievable parameter configuration.
This is the core quantitative result. TIDE amplifies the gradient signal for every token by a factor of K.
Proposition 3.2. Under minibatch SGD, the total expected cumulative squared gradient norm across all K embedding tables for token v satisfies:
Where κv = 1 - (1 - fv)BT ≈ fv · BT for small fv.
Let's derive this carefully.
Compare this to the standard transformer's bound: τ · fv · BT · G2 (a single gradient pathway). TIDE provides K times more gradient signal through K independent pathways.
Proposition 3.3. For a collapsed token pair (u, v) with ||hu - hv|| ≤ δ, the EmbeddingMemory can achieve any target separation C > 0 regardless of δ and LFFN.
This follows from a simple but powerful observation. The memory output Mk(v) = RMSNorm(Ek[v]) depends on the discrete token index v, not on the hidden state h. The hidden state collapses? The memory doesn't care — it looks up v directly.
The embedding rows Ek[u] and Ek[v] are separate, uncoupled parameters. They can be set to any values independently, so the RMSNorm outputs can achieve any prescribed separation. This is fundamentally different from the FFN, which must map from continuous hidden states and is therefore subject to the Lipschitz bound.
Compare cumulative gradient signal for a rare token (fv = 10-8) in a standard transformer (1 pathway) vs TIDE with K memory blocks. Drag the K slider to see the amplification.
Theory gives us guarantees. Now let's see what TIDE actually delivers on real benchmarks at real scale.
The paper's most striking result is Figure 5: mean validation cross-entropy loss per frequency decile for LLaMA-Base-1B vs TIDE-8E-1B, both trained on 200B tokens.
TIDE improves on every single decile, but the gains are sharply asymmetric:
| Frequency Decile | Loss Reduction (nats) | Relative Improvement |
|---|---|---|
| 0-10% (rarest) | 0.704 | 9.0% |
| 10-20% | 0.507 | 6.5% |
| 20-30% | 0.301 | 5.2% |
| 30-40% | 0.194 | 4.2% |
| 40-50% | 0.138 | 3.0% |
| 50-60% | 0.135 | 3.1% |
| 60-70% | 0.125 | 3.1% |
| 70-80% | 0.122 | 3.2% |
| 80-90% | 0.118 | 2.6% |
| 90-100% (most common) | 0.068 | 2.4% |
The rare-to-common improvement ratio is 0.704/0.068 = 10.4×. TIDE helps rare tokens roughly ten times more than common tokens. This is exactly the signature of K-fold gradient amplification — the tokens that were most gradient-starved benefit the most from the additional gradient pathways.
On standard language modeling benchmarks with the 1B-scale model family:
| Model | WikiText-2 PPL ↓ | PubMed PPL ↓ | DCLM PPL ↓ |
|---|---|---|---|
| LLaMA-Base-1B | ~13.0 | ~15.0 | ~21.0 |
| TIDE-2E-1B | ~12.5 | ~14.3 | ~20.0 |
| TIDE-8E-1B | ~11.8 | ~13.5 | ~18.5 |
| TIDE-16E-1B | ~11.5 | ~13.2 | ~18.0 |
| TIDE-24E-1B | ~11.2 | ~12.8 | ~17.5 |
Improvement is monotonic in K — more MemoryBlocks always helps, and there is no saturation even at K = 24. The gains are substantial: TIDE-24E-1B reduces WikiText-2 perplexity by ~14% relative to the baseline.
A remarkable finding: TIDE with just 2-4 MemoryBlocks at 100B training tokens matches the perplexity that the baseline reaches at 200B tokens. The additional gradient pathways translate directly to faster effective convergence — TIDE learns more from each training step.
Zero-shot accuracy across eight benchmarks at the 1B scale:
| Model | ARC-C | ARC-E | BoolQ | HellaSwag | LAMBADA | PIQA | Average |
|---|---|---|---|---|---|---|---|
| LLaMA-Base-1B | 37.5 | 64.4 | 61.7 | 63.9 | 64.6 | 74.9 | 61.4 |
| TIDE-8E-1B | 37.5 | 64.5 | 69.3 | 65.3 | 64.7 | 75.5 | 63.0 |
| TIDE-24E-1B | 38.9 | 66.3 | 69.5 | 66.3 | 66.4 | 77.3 | 63.7 |
The average improves from 61.4 to 63.7 (+2.3 points absolute). Notable individual gains include BoolQ (+7.8 points), HellaSwag (+2.4), and PIQA (+2.4). The improvements scale consistently from 750M to 3B parameters, confirming TIDE is not a small-model trick.
The paper revisits the three collapse categories from Chapter 2 and compares L2 separation between base and TIDE models. Across all categories, TIDE increases layer-wise separation, with the largest gains in the middle-to-terminal layers (where collapse is most severe). Numeric tokens — the worst collapse category — are the biggest beneficiary, seeing +26.1 mean L2 improvement.
Bar chart comparing cross-entropy loss per frequency decile. Toggle between absolute loss and improvement delta.
The paper decomposes held-out cross-entropy by token frequency as K increases from 0 (baseline) to 24. The rare-token loss slope is 3.7× steeper than the common-token slope — each additional MemoryBlock benefits rare tokens almost four times more than common tokens. Even K = 2 captures ~55% of the total rare-token improvement at K = 24, suggesting the bulk of the benefit is achievable with modest overhead.
TIDE adds K embedding tables, each of size |V| × db. With |V| = 128,256 tokens and db = 256, each MemoryBlock is 128,256 × 256 = 32.8 million parameters. At K = 8, that's 262M extra parameters. At K = 24, it's 787M. How do we keep this manageable?
During inference, the EmbeddingMemory tables are read-only lookup tables. They are indexed by token identity — a discrete integer — and never modified. This means they have the same properties as the primary embedding table E: they are static, their access pattern is known in advance (determined by the input tokens), and they can be heavily compressed.
The paper demonstrates that the EmbeddingMemory tables can be quantized to 4-bit precision with negligible performance impact. Why does this work so well?
Each MemoryBlock output passes through RMSNorm, which normalizes the vector to approximately unit norm. The downstream computation is a weighted sum followed by addition to the FFN input. The model is robust to small perturbations in this additive signal — the FFN was trained to work with approximate memory vectors, not exact ones.
At 4-bit precision:
For K = 8: 131 MB total. For K = 24: 394 MB total. These are small numbers.
The memory tables can be offloaded to SSD (solid-state drive) rather than kept in VRAM. Since the access pattern is determined by the input token sequence (known before the forward pass begins), the tables can be prefetched asynchronously. The latency of SSD access (~100-500 μs) is hidden behind the GPU computation of attention and FFN.
The paper provides a detailed breakdown for each TIDE variant:
| Model | VRAM Params (8-bit) | SSD Params (4-bit) | Total Params |
|---|---|---|---|
| LLaMA-Base-1B | 1.028 GB | 0 GB | 1.03B |
| TIDE-2E-1B | 1.028 GB | 0.263 GB | 1.05B |
| TIDE-4E-1B | 1.028 GB | 0.525 GB | 1.05B |
| TIDE-8E-1B | 1.028 GB | 1.051 GB | 1.03B + 0.53B mem |
| TIDE-16E-1B | 1.028 GB | 2.101 GB | 1.03B + 1.05B mem |
| TIDE-24E-1B | 1.028 GB | 3.152 GB | 1.03B + 1.58B mem |
The critical column is VRAM: it stays at 1.028 GB for every TIDE variant, identical to the baseline. The only cost is SSD storage, which is cheap. The VRAM footprint — the actual constraint for GPU deployment — is unchanged.
Adjust K (number of MemoryBlocks) and see the VRAM vs SSD breakdown. VRAM stays constant while SSD scales linearly with K.
Beyond memory, what about compute? Each TIDE layer adds:
| Operation | Cost | Relative to FFN |
|---|---|---|
| Router projection | (K+1) × d multiply | Negligible (~0.1%) |
| Softmax over K+1 | O(K) per token | Negligible |
| Weighted sum of K vectors | K × db multiply-add | Negligible (~0.5%) |
| Memory lookup | K table lookups (prefetched) | Hidden behind GPU compute |
The total additional FLOPs per layer are dominated by the router projection: (K+1) × d multiplications. For K = 8, d = 2048, that's 9 × 2048 = 18,432 FLOPs. Compare to the FFN's ~2 × d × 4d = 2 × 2048 × 8192 = 33.6M FLOPs. The router is less than 0.06% of the FFN cost.
python # Complete TIDE layer pseudocode class TIDELayer(nn.Module): def __init__(self, d_model, d_b, K, n_heads): self.attn = MultiHeadAttention(d_model, n_heads) self.ffn = SiLUGatedFFN(d_model) self.norm1 = RMSNorm(d_model) self.norm2 = RMSNorm(d_model) self.router = nn.Linear(d_model, K + 1) # tiny self.mem_proj = nn.Linear(d_b, d_model) # if d_b != d_model def forward(self, h, memory): # memory: [B, T, K, d_b] — precomputed # Attention block (unchanged) h_tilde = h + self.attn(self.norm1(h)) # Router: which memory blocks to use at this depth? n_tilde = self.norm2(h_tilde) # [B, T, d] alpha = F.softmax(self.router(n_tilde), dim=-1) # [B, T, K+1] # Memory injection B, T, K, d_b = memory.shape null = torch.zeros(B, T, 1, d_b, device=memory.device) mem_null = torch.cat([memory, null], dim=2) m = (alpha.unsqueeze(-1) * mem_null).sum(dim=2) # [B, T, d_b] m = self.mem_proj(m) # [B, T, d_model] # FFN with memory-augmented input h = h_tilde + self.ffn(n_tilde + m) return h
TIDE doesn't exist in isolation. It connects to several important threads in the LLM landscape. Understanding these connections helps you see where the field is heading.
The closest analog to TIDE is Rotary Position Embeddings (RoPE). RoPE re-injects position information at every attention layer by rotating the query and key vectors. Without RoPE, the model would need to infer position from the residual stream alone — and it would suffer the same kind of "positional collapse" that TIDE fixes for token identity.
The parallel is exact:
| Property | RoPE (Position) | TIDE (Token Identity) |
|---|---|---|
| What is injected | Position index → rotation matrix | Token index → memory vector |
| Where | Every attention layer (Q, K) | Every FFN input |
| How | Multiplicative rotation | Additive fusion |
| Without it | Position lost in deep layers | Token identity lost in deep layers |
| Discrete input? | Yes (position integer) | Yes (token index integer) |
A series of papers (Geva et al. 2021, 2022; Dai et al. 2022; Meng et al. 2022) established that FFN layers in transformers function as key-value memories. The first FFN layer acts as a pattern detector (keys) and the second projects specific information into the residual stream (values). Specific neurons were identified as "knowledge neurons" storing individual facts.
TIDE addresses a limitation of this paradigm: FFN-as-memory is indexed by the continuous hidden state, which means it inherits the Lipschitz bottleneck. When two tokens produce similar hidden states, the FFN retrieves similar "memories" for both, even if they need different factual information. TIDE's EmbeddingMemory is indexed by discrete token identity, bypassing this entirely.
The lineage traces back to Memory Networks (Weston et al. 2014), End-to-End Memory Networks (Sukhbaatar et al. 2015), and Neural Turing Machines (Graves et al. 2014, 2016). These augment neural networks with external read-write memory banks. Product-key networks (Lample et al. 2019) improved scaling with efficient memory retrieval.
TIDE differs from all of these in a crucial way: its memory is not queried by content but by identity. Traditional memory networks use an attention-like mechanism to match queries against memory keys. TIDE simply looks up the token index — no matching, no scoring, no key-value attention. This makes TIDE O(1) per lookup rather than O(memory_size).
RAG systems (Lewis et al. 2020; Borgeaud et al. 2022; Izacard et al. 2023) augment LLMs with external knowledge retrieved at inference time. TIDE is fundamentally different: its memory is internal (part of the model parameters), trained end-to-end (not a separate retrieval system), and indexed by token identity (not by semantic similarity).
However, TIDE and RAG are complementary. RAG provides external, updateable knowledge. TIDE provides persistent token-level identity. A TIDE model with RAG would have both a strong internal memory (MemoryBlocks) and access to external corpora.
Sparse autoencoders (SAEs) decompose transformer activations into interpretable features. TIDE's MemoryBlocks may be more directly interpretable than FFN activations because each block provides a clean, token-indexed vector that can be examined in isolation. The paper's analysis of cosine distances between blocks suggests each block encodes a distinct "aspect" of token identity — an intriguing direction for mechanistic interpretability.
TIDE builds on several concurrent works. MoLE (Jie et al. 2025) showed that in mixture-of-experts models, most experts can be trained directly with token-level embeddings. MemoryLLM (Jaiswal et al. 2026) completely decouples FFNs from the contextual residual stream by training layer-local, token-indexed embedding tables for interpretability. STEM (Sadhukhan et al. 2026) partially replaces FFN up-projections with embedding table lookups. TIDE unifies and extends these ideas with its global EmbeddingMemory shared across all layers, depth-conditioned routing, and null bank.
| Limitation | Potential Direction |
|---|---|
| Memory tables grow linearly with vocab size | Structured embeddings (product quantization, LSH) |
| Tested up to 3B scale | Larger-scale experiments (7B, 70B) may reveal saturation |
| Fixed K across all layers | Adaptive K that varies by depth |
| Static memory at inference | Online memory updates for continual learning |
| Trains from scratch only | Retrofit into existing pretrained models |