Santoro, Faulkner, Raposo, Rae, Chrzanowski, Weber, Wierstra, Vinyals, Pascanu & Lillicrap — DeepMind, NeurIPS 2018 • arXiv:1806.01822

Relational Recurrent
Neural Networks

LSTMs can remember. But can they reason about how their memories relate to each other? This paper introduced the Relational Memory Core — replacing LSTM gates with multi-head attention over memory slots — so that memories interact at every time step. The result: a recurrent network that can sort, compare, and relate things it saw at different moments in time.

Prerequisites: RNNs & LSTMs + Basic attention (we build the rest from scratch)
10
Chapters
1
Simulation
10
Quizzes

Chapter 0: The Problem

You are watching eight numbered balls land on a table, one at a time. After the last one lands, someone asks: "Which ball is the 3rd farthest from ball 5?"

To answer, you need to recall every ball's position, compute the distance from ball 5 to each of the others, sort those distances, and pick the 3rd largest. This is not just memory — it is relational reasoning over memories. You must compare things you saw at different points in time and reason about how they relate to each other.

An LSTM can remember things. It packs all information into a single hidden state vector — a long list of numbers that gets updated at every time step. But that single vector is a compressed soup. There is no structure that separates "ball 1's position" from "ball 4's position." When you need to compute the distance between two specific balls, you have to untangle them from the soup. In practice, LSTMs fail catastrophically at this task — stuck below 30% accuracy.

The core gap: Standard recurrent networks can store information and retrieve it, but they have no built-in mechanism to compare stored memories with each other. Memory storage and memory reasoning are different abilities, and current architectures only optimize for the first.

Memory-augmented networks like the DNC (Differentiable Neural Computer) try to fix this by giving the network an external memory matrix with separate slots. That helps with storage — each slot can hold a different piece of information. But DNCs read and write slots independently. They do not have a mechanism for slot 3 to directly ask "how do I relate to slot 7?" The memories sit in isolation.

Santoro et al. proposed a simple fix: at every time step, let all the memories attend to each other using multi-head dot-product attention — the same mechanism from the Transformer. They called the resulting module a Relational Memory Core (RMC). The key idea is that memory interaction is just as important as memory storage. If your memories can talk to each other, they can reason about relationships.

The RMC solved the ball-sorting task at 91% accuracy where LSTMs and DNCs could not break 30%. It also achieved state-of-the-art results on WikiText-103 language modeling, program evaluation, and reinforcement learning tasks. The inductive bias of memory-memory interaction turned out to be broadly useful — not just for toy relational tasks, but for the messy, large-scale problems where we actually need better models.

In the chapters that follow, we will build the RMC piece by piece: the memory matrix, the attention mechanism that lets memories interact, the way new inputs are encoded, and the LSTM gating that provides temporal persistence. Each piece is simple. Their combination is powerful.

Why do standard LSTMs struggle with relational reasoning over sequential information?

Chapter 1: Memory as a Matrix

The first design decision is deceptively simple: instead of storing memory as a single vector, store it as a matrix.

Let M be an N × F matrix, where N is the number of memory slots and F is the size of each slot. Each row mi of M is one memory slot — a vector of F numbers that can hold a distinct piece of information. Think of it as a notebook with N pages, each page holding F words.

M = [m1; m2; …; mN] ∈ ℝN×F

This is not a new idea by itself. DNCs, Neural Turing Machines, and Entity Networks all use memory matrices. What makes the RMC different is what happens next — how the rows interact.

Why does splitting memory into slots help? Consider the ball-sorting task. With N = 8 slots, the network can learn to write each ball's information into a separate slot. Ball 3's position lives in m3, ball 7's position lives in m7. When you need to compare them, you know exactly where to look. With a single LSTM vector, all eight balls are blended together from the start.

This is the same insight behind databases: structured storage enables structured queries. A spreadsheet with each item in its own row is fundamentally more queryable than a paragraph of prose containing the same information. The memory matrix gives the RMC's attention mechanism something structured to operate on.

Compartmentalization is the first step. You cannot reason about the relationship between two things if those two things are tangled into one vector. Separate storage enables separate access, which enables comparison. The matrix structure provides the potential for relational reasoning — the attention mechanism (next chapter) provides the mechanism.

The total memory capacity is N × F — the total number of elements in M. The authors found an important trade-off: some tasks need more slots (each holding less), while others need fewer, larger slots. The Nth Farthest task benefits from 8 or 16 small memories (one per input vector). Language modeling works best with just 1 or 2 large memories. The number of parameters is proportional to F (the slot size), not N (the number of slots), because all slots share the same projection weights. You can add more slots without adding more parameters.

If an RMC has 4 memory slots of size 512, what is the total memory capacity, and what determines the parameter count?

Chapter 2: Attention Between Memories

Here is the core idea of the paper, and it comes straight from the Transformer. Given memory matrix M, we compute multi-head dot-product attention (MHDPA) where each memory slot attends to every other memory slot.

Step 1: project each memory row into queries, keys, and values.

Q = MWq,   K = MWk,   V = MWv

Each row mi of M gets mapped to a query qi, a key ki, and a value vi. The query asks "what information am I looking for?", the key advertises "what information do I contain?", and the value carries "what information should I share?"

Step 2: compute attention weights and the weighted average.

Aθ(M) = softmax(QKT / √dk) · V

The dot product Q KT computes a similarity score between every pair of memory slots. The softmax normalizes each row of scores into weights that sum to 1. The multiplication by V produces an updated memory where each slot is a weighted blend of information from all other slots.

Memory-memory interaction in one equation. After this operation, m3 now contains information from m1, m5, m7 — whichever slots it attended to. The network learns which memories to relate through the projection weights Wq, Wk, Wv. This is the relational reasoning step: every memory explicitly considers every other memory.

The "multi-head" part means we repeat this with h different sets of projection weights (Wq1, Wk1, Wv1), (Wq2, Wk2, Wv2), and so on. Each head produces an N × (F/h) output, and we concatenate them column-wise to get back an N × F matrix. Different heads can learn to share different kinds of information between memories. One head might relate positions. Another might relate identities.

After the attention, a row-wise MLP with layer normalization is applied to each memory slot independently. This is the same post-attention processing used in Transformers. It gives each memory a chance to non-linearly transform the information it gathered from other slots. The authors found that the number of MLP layers matters — for language modeling, a 5-layer MLP worked best, suggesting that significant per-slot computation is needed after the relational step.

The paper also allows multiple "blocks" of attention per time step. One block means each memory attends to all others once. Two blocks means the output of the first round of attention becomes the input to a second round. More blocks enable higher-order interactions: in block 1, slot A learns about slot B; in block 2, slot A can learn about "what slot C learned from slot B." In practice, 1 block was sufficient for most tasks.

There is an important contrast with how attention is used in other architectures. In seq2seq models and Transformers, attention operates across time — the query at time t attends to keys from times 1 through t-1. In the RMC, attention operates within a single time step, across memory slots. It is spatial, not temporal. The recurrence (next chapters) handles the temporal dimension.

In the RMC, what does multi-head dot-product attention operate over?

Chapter 3: Encoding New Inputs

So far we have memories interacting with each other, but how does new information enter the memory? The paper handles this with an elegant trick: row-concatenate the input with the memory matrix before computing keys and values, but use only the memory for queries.

M̃ = softmax( MWq · ([M;x]Wk)T / √dk ) · [M;x]Wv

Here [M; x] means stacking the input vector x as an extra row beneath M. If M has N rows, then [M; x] has N + 1 rows. The keys and values come from N + 1 sources (all memories plus the new input), but the queries come from only N sources (the memories). So the output M̃ has the same N × F shape as M.

One mechanism for two jobs. This single attention operation simultaneously (1) lets memories interact with each other and (2) lets memories absorb new input information. Each memory slot decides how much to attend to other memories vs. how much to attend to the new input — all through learned attention weights. There is no separate "write" operation.

Think about what each memory is doing during this step. It looks at all other memories and at the new input, then decides: "Do I care about this new observation? Does it change what I'm storing? Or should I focus on what the other memories are telling me?" The attention weights encode these decisions.

This is fundamentally different from how an LSTM incorporates input. An LSTM uses gates (forget, input, output) to control what enters the hidden state. Those gates are computed from the current input and the previous hidden state — a function of just two things. In the RMC, each memory slot's update is a function of every other memory and the input. The contextual awareness is richer.

An interesting special case: when M has only one row (a single memory slot), this reduces to something LSTM-like. The single memory attends to itself and the input, deciding how much new information to incorporate. But even here, multiple attention heads can provide a form of compartmentalization — each head attends to different aspects of the input, effectively creating sub-memories within the single vector.

Let us trace a concrete example. Suppose M has 3 slots and a new input x arrives. The key matrix K has 4 rows (3 memory keys + 1 input key). Each memory slot produces a query that scores against all 4 keys. If slot 1 holds "ball A position" and slot 2 holds "ball B position," and the new input x is "now compute distances," then slot 1's query might attend strongly to slot 2 (to compute the A-B distance) and to x (to register the instruction). Slot 3 might attend mostly to x to store the instruction for later. The attention distribution encodes both the information flow and the task strategy.

Why does the RMC use [M; x] for keys/values but only M for queries?

Chapter 4: LSTM Gating

Attention gives us memory interaction. But we still need recurrence — the ability to carry information across time steps. The paper embeds the attention-based update into an LSTM-style gating mechanism.

The idea: treat the memory matrix M as a matrix of LSTM cell states. Each row mi is a cell state that gets gated independently. The attention output M̃ replaces what would normally be the "candidate" in a standard LSTM.

Forget gate
fi,t = Wfxt + Ufhi,t-1 + bf
Input gate
ii,t = Wixt + Uihi,t-1 + bi
Cell update
mi,t = σ(fi,t) ⊙ mi,t-1 + σ(ii,t) ⊙ gψ(m̃i,t)
Output gate
hi,t = σ(oi,t) ⊙ tanh(mi,t)

The critical line is the cell update. In a standard LSTM, the candidate (what gets added to the cell state) comes from a simple tanh(Wx + Uh). In the RMC, it comes from gψ(m̃i,t) — the attention output processed through a row-wise MLP with layer normalization. The attention output already incorporates information from all other memory slots and the input. The gates then control how much of this relational information actually gets written into the cell state.

The key modification to a standard LSTM. Replace the candidate computation — which is just a linear function of input and previous hidden state — with attention over all memory slots plus the input. The forget gate, input gate, and output gate work exactly as in a standard LSTM. Only the source of "new information" changes.

The paper also introduces memory gating as an alternative to standard "unit gating." In unit gating, each individual element of mi gets its own gate scalar — just like a standard LSTM. In memory gating, each entire row mi gets a single scalar gate. This means the gate decides "how much should this entire memory slot forget?" rather than "how much should each element forget?" Memory gating is cheaper (fewer parameters) and sometimes works better, because it enforces a holistic decision: either keep this memory or replace it, rather than fragmenting the decision across individual dimensions.

In practice, the authors found that output gates were not necessary. The implementation in DeepMind's Sonnet library omits them. This simplification makes sense: the attention mechanism already provides a sophisticated selection mechanism for what information to expose. The gates primarily serve to control the temporal dynamics — how quickly old information is overwritten by new relational information.

The gate weights Wf, Wi, Wo are shared across all memory slots, just as the attention weights are. So adding more memory slots adds no parameters — only more storage capacity.

What replaces the standard LSTM candidate in the Relational Memory Core?

Chapter 5: The Nth Farthest Task

To test whether the RMC truly enables relational reasoning over memories, the authors designed a task that is impossible without it. The setup: the model receives 8 randomly sampled 16-dimensional vectors, one per time step, each with a random label (1-8). After all vectors arrive, the model must answer: "What is the Nth farthest vector from vector M?" where N and M are given as part of the input.

Why this task is so hard. The model must: (1) store all 8 vectors in memory, (2) recall which vector has label M, (3) compute the Euclidean distance from M to every other vector, (4) sort those distances, and (5) return the Nth largest. Steps 3-5 are pure relational reasoning — comparing memories to each other. And the model might receive vector M first, last, or somewhere in the middle of the sequence.

The results are stark:

LSTM
Below 30% accuracy. Cannot untangle vectors from its compressed hidden state to compute pairwise distances.
DNC
Below 30% accuracy. Can store vectors in separate slots, but has no mechanism for slots to compare with each other.
RMC
91% accuracy. Memory slots interact via attention, enabling distance computation and implicit sorting.

The attention analysis reveals how the RMC solves this. Before the reference vector M arrives, the model writes each input into one or two memory slots — the attention weights show high values from a few slots' queries to the input key. After M arrives, the behavior changes dramatically: all memory slots shift their attention to focus on the slots where M was written. The model seems to be comparing each stored vector against M, exactly the distance-computation step the task requires.

The hyperparameter analysis is also revealing. With 8 or 16 memory slots, the RMC solves the task robustly. With just 1 memory slot, performance depends on the number of attention heads — 8 or 16 heads help, but 1 head does not. This suggests that even a single memory slot can use multiple attention heads to compartmentalize information into sub-spaces, though not as effectively as using separate slots.

An important detail: the model receives the task specification (values of N and M) appended to every input time step, not just at the end. So the model knows from the start which vector is the reference and what rank to find. Yet LSTMs and DNCs still fail. Knowing what to compute is not enough — you also need the right mechanism to compute it. The LSTM knows it should compare vectors, but its single-vector hidden state makes the comparison intractable.

Why does the DNC also fail at the Nth Farthest task despite having a memory matrix with separate slots?

Chapter 6: Language Modeling & RL

The Nth Farthest task was a proof of concept. But does memory interaction help on real-world tasks? The paper tests the RMC on three domains: language modeling, program evaluation, and reinforcement learning.

Language modeling. The RMC achieved state-of-the-art perplexity on three benchmarks at the time of publication:

WikiText-103
29.0 perplexity (LSTM baseline: 29.2). Modest gain, but on a very competitive benchmark.
Project Gutenberg
38.5 vs. 43.7 for LSTM. Literature text benefits more from relational memory.
GigaWord
38.3 vs. 43.7 for LSTM. News articles with complex entity references.

Interestingly, the best language modeling configuration used only 1 memory slot with 2500 units and 4 attention heads. Language does not need many separate memory compartments — but it benefits from the multi-head attention allowing the single memory to relate different aspects of context to each other. The heads provide implicit compartmentalization within a single vector.

A surprising finding about context length. When the RMC was tested with shorter unroll lengths, its perplexity degraded less than the LSTM's. At 100-step unrolls, the RMC lost only 1 perplexity point while the LSTM lost 5. This suggests the RMC focuses on more recent, locally relevant context rather than relying on long-range state transfer. The attention mechanism may help the model be more efficient with the information it retains.

Program evaluation. On the Learning to Execute benchmark — programs with variables, loops, and conditionals — the RMC matched or exceeded all baselines. Programs involve symbolic manipulation: variable x gets assigned, then modified in a loop, then used in a conditional. Tracking the value of x requires relating the assignment, the loop body, and the conditional — a relational task.

Reinforcement learning. On Mini PacMan with a limited viewport (5×5 window), the agent must navigate a maze using only partial observations. It needs to remember where food was, where ghosts were last seen, and plan accordingly. The RMC agent learned effective exploration strategies, outperforming a ConvLSTM baseline.

On BoxWorld — a relational planning task where the agent must find the right key-lock sequence by navigating a grid — the RMC solved 98% of levels compared to 73% for ConvLSTM. This task is particularly telling: the key-lock pairs are scattered randomly across the grid, so spatial proximity is irrelevant. The agent must reason purely about abstract relationships between colors (which key opens which lock), making it a clean test of relational reasoning in memory. The ConvLSTM's spatial inductive bias actively hurts here.

For language modeling, the best RMC configuration used only 1 memory slot. Why did multiple slots not help?

Chapter 7: Showcase — Interactive RMC Step

Below is an interactive simulation of one time step of the Relational Memory Core. You have a memory matrix M with 4 slots. Click "Send Input" to feed a new observation. Watch how attention distributes across memory slots and the input, and how each memory updates with information from the others.

Click a memory slot to see its attention weights. Click "Send Input" to run a new step.

Ready — 4 memory slots initialized

Notice how after attention, each memory slot is a blend of its old content and information from other slots. The attention weights (shown as connection thickness) reveal which memories are interacting most strongly. The input gets incorporated through the same attention mechanism — some slots attend heavily to it while others focus on inter-memory communication.

In the RMC's attention step, what determines how much memory slot i attends to memory slot j?

Chapter 8: Why It Works

The RMC combines three ingredients that each address a different aspect of temporal relational reasoning:

Compartmentalization
The memory matrix separates information into distinct slots, preventing the lossy compression of a single LSTM vector.
+
Interaction
Multi-head attention lets slots communicate, enabling comparison, contrast, and aggregation of stored information.
+
Persistence
LSTM gating controls what is retained and what is forgotten, providing stable long-term storage across time steps.

Remove any one ingredient and the system breaks. Without compartmentalization (single vector), you get an LSTM — good at storing but bad at comparing. Without interaction (independent slots), you get a DNC or Entity Network — good at storing separately but bad at relating. Without persistence (no gates), you get something closer to a feedforward Transformer — good at relating within a window but bad at maintaining state across many steps.

This three-part decomposition also explains the hyperparameter sensitivity. Tasks with many distinct entities (Nth Farthest, BoxWorld) need many slots for compartmentalization. Tasks with rich, continuous context (language modeling) need large slots for representation capacity. The balance between slots and slot size, combined with the number of attention heads, determines what kind of relational reasoning the model can perform.

The deeper insight: relational reasoning requires structure. You cannot reason about relationships between unstructured representations. The RMC's contribution is showing that the right structural bias — separate slots plus self-attention — is sufficient to unlock relational reasoning in recurrent networks. The mechanism is borrowed from Transformers, but the application is novel: applying attention to memory, not to input sequences.

There are limitations. The attention step is O(N²) in the number of memory slots, which limits scalability. The model requires tuning the number of slots, heads, and attention blocks per task. And the paper does not fully explain why certain configurations work for certain tasks — why does language modeling prefer 1 large slot while Nth Farthest needs 8 small ones? The analysis of attention patterns is suggestive but not conclusive, since after even one round of attention the memory becomes highly distributed, making interpretation difficult.

The paper also raises an interesting question about generalization. The RMC was trained on 16-dimensional vectors and tested with the same dimensionality. When the authors tried 32-dimensional vectors, performance was less robust — fewer model configurations succeeded. This hints that the relational reasoning capacity does not trivially generalize to harder problem settings, and that the number of memory slots and attention heads may need to be tuned in tandem with task complexity.

Still, the paper's central claim is well-supported: standard memory architectures lack a relational inductive bias, and adding one (via self-attention over memory slots) produces measurable gains on tasks requiring relational reasoning over time.

What happens if you keep the memory matrix but remove the attention mechanism (making slots independent)?

Chapter 9: Connections

The Relational Memory Core sits at the intersection of three major research threads: recurrent memory, relational reasoning, and self-attention. Understanding where it came from and where it led clarifies its place in the broader landscape.

Transformers (Vaswani et al., 2017). The RMC borrows its core mechanism — multi-head dot-product attention — directly from the Transformer. The key difference is context: Transformers apply self-attention across all positions in a sequence (or all previous positions), while the RMC applies it across memory slots at a single time step. The Transformer processes the full sequence in parallel; the RMC processes it recurrently, one step at a time, using attention only for memory interaction. This paper showed that the Transformer's attention mechanism is useful beyond sequence-to-sequence modeling — it can serve as a general-purpose relational reasoning module.

Relation Networks (Santoro et al., 2017). Written by overlapping authors just one year earlier, Relation Networks introduced the idea that pairwise comparison of all entities is a powerful inductive bias for relational reasoning. The RN operates on spatial entities (objects in an image); the RMC extends this idea to temporal entities (memories stored over time). Where the RN uses an MLP on each pair with summation, the RMC uses attention — a more parameter-efficient way to compute pairwise interactions. The intellectual lineage is direct.

Memory-augmented neural networks. Neural Turing Machines (Graves et al., 2014) and DNCs (Graves et al., 2016) introduced the idea of external memory matrices with read/write heads. The RMC keeps the memory matrix but replaces the controller-driven read/write mechanism with self-attention. This is arguably simpler: instead of learning to address specific slots, the model learns which slots should interact through attention. The trade-off is that attention is all-to-all, while DNCs can be more surgical in their memory access.

Transformer-XL and recurrence in Transformers. After this paper, Transformer-XL (Dai et al., 2019) tackled the same problem from the opposite direction: adding recurrence to Transformers rather than adding attention to RNNs. Transformer-XL caches hidden states from previous segments and attends to them in subsequent segments. Both approaches recognize that you need both attention and recurrence for long-range temporal reasoning, but they arrive from different starting points.

Modern large language models. Today's LLMs (GPT, Claude, etc.) are pure Transformers that handle both storage and relational reasoning through self-attention over long context windows, without explicit recurrence. The RMC's insight — that memory slots should interact — is baked into every Transformer layer. In a sense, the Transformer won by applying the RMC's core idea (self-attention over memories) at every layer, to every token, with massive scale. The RMC remains relevant in settings where recurrence is needed: streaming data, resource-constrained environments, and tasks requiring indefinite-length processing.

The lasting contribution. This paper demonstrated that the gap between "remembering" and "reasoning about memories" can be closed by a single mechanism: self-attention over memory slots. The Transformer already had this mechanism for sequences; the RMC showed it works for recurrent memory too. Every modern architecture that maintains and reasons over internal state — from State Space Models to memory-augmented Transformers — inherits this insight.

Paper details. "Relational recurrent neural networks," Adam Santoro, Ryan Faulkner, David Raposo, Jack Rae, Mike Chrzanowski, Théophane Weber, Daan Wierstra, Oriol Vinyals, Razvan Pascanu, Timothy Lillicrap. NeurIPS 2018. arXiv:1806.01822.

← Back to Veanors Hub

What is the key difference between how the RMC and the Transformer use multi-head attention?