Relation Networks for Relational Reasoning

Chapter 0: The Problem

Look at a table with a red ball, a blue cube, and a green cylinder. Now answer: "What shape is the object closest to the blue cube?"

You solved that instantly. You identified the blue cube, computed its distance to every other object, found the closest one, and reported its shape. Four operations, done in a flash. This is relational reasoning — reasoning about how entities relate to each other, not just what they are.

Now try to build a neural network that does the same thing. A convolutional neural network can recognize that there is a red ball in the image. It can tell you the ball is red. It can even localize the ball in the image. But ask "is the red ball left of the blue cube?" and the CNN struggles. Why?

The core difficulty: Standard neural networks process features independently. A CNN learns filters that detect local patterns — edges, textures, shapes. But "left of" is not a local pattern. It is a relationship between two spatially separated entities. No single convolutional filter can capture it.

Before this paper, the best approach to visual question answering on the CLEVR dataset — a benchmark specifically designed to test relational reasoning — scored 68.5%. Humans scored 92.6%. The gap was enormous, and it was entirely due to relational questions. On non-relational questions ("what color is the big sphere?"), models did reasonably well. On relational questions ("is the cube the same material as the cylinder?"), the best models performed barely above a baseline that just memorized answer frequencies. Researchers had thrown ResNets, VGGs, attention mechanisms, and huge fully connected layers at the problem. Nothing worked.

Santoro et al. proposed a module so simple it seems like it should not work: take every possible pair of objects, pass each pair through a small neural network, and sum the results. They called it a Relation Network (RN). It scored 95.5% on CLEVR — surpassing human performance.

The lesson of this paper is that architecture matters. Not more parameters, not more data, not a cleverer training trick. Just the right structural bias: consider all pairs. This is the same lesson that convolutions taught for spatial processing, and that recurrence taught for sequences. Each architectural innovation succeeds because it bakes in the right inductive bias for the problem at hand. For relational reasoning, the inductive bias is: compare everything to everything.

And the module is truly plug-and-play. The paper demonstrates it on three completely different domains — visual question answering, text-based question answering, and physical system reasoning — with the same core architecture. The only thing that changes is how "objects" are defined.

Why do standard CNNs struggle with relational questions like "is the red ball left of the blue cube?"

Because "left of" is a relationship between two spatially separated entities, and convolutional filters only detect local patterns Because CNNs cannot process color information Because CNNs require too much training data for relational tasks

Chapter 1: What Is a Relation?

Before we write any equations, let us build intuition for what "relational reasoning" actually means. Consider three types of questions you could ask about a scene:

Recognition

"What color is the large sphere?" — requires identifying one object and reading its attribute.

↓

Comparison

"Is the cube the same material as the cylinder?" — requires identifying two objects and comparing one attribute.

↓

Counting

"How many objects are the same shape as the green thing?" — requires comparing the green thing to every other object.

The first question is non-relational. You only need to look at one object. The second and third are relational. You must consider pairs of objects and reason about how they relate.

Here is the key insight: a relation is a property that emerges from considering two or more entities together. "Red" is a property of a single object. "Left of" is a property of a pair. "Between" is a property of a triple. You cannot determine "left of" by looking at either object alone — you need the positions of both.

Psychologists have studied relational reasoning for decades. It is considered a hallmark of human intelligence. Children develop relational thinking around age 3-4, and it is central to analogy, planning, and abstract thought. Standardized intelligence tests (like Raven's Progressive Matrices) are fundamentally tests of relational reasoning.

Yet before this paper, neural networks — even deep, powerful ones — had no built-in mechanism for relational reasoning. They had to learn it implicitly through massive parameter counts and lots of data. Usually, they failed. A CNN with millions of parameters, trained on millions of images, could not answer "is the red ball left of the blue cube?" The capacity was not the problem. The structure was.

Relations are everywhere. "Sandra went to the office" + "Sandra picked up the football" → "Where is the football?" requires relating Sandra's location to the football's location. Physics: two balls connected by an invisible spring — you infer the spring by observing how their relative motion is constrained. Relational reasoning is not just vision. It is a fundamental cognitive ability.

The paper frames the problem cleanly: given a set of objects (whatever those may be — image regions, sentence encodings, physical entities), learn a function that reasons about their pairwise relations and aggregates those relations into an answer.

Notice two critical design choices in that framing. First, pairwise: the paper restricts attention to relations between pairs of objects, not triples or higher-order combinations. This is a simplification, but it turns out to be surprisingly powerful. Second, all pairs: the network does not try to guess which pairs matter. It considers every possible pair and lets the learning process figure out which relations are relevant.

Why not triples or higher? Pragmatism. For n objects, there are n² pairs but n³ triples. With even modest n, computing all triples becomes prohibitive — 5 objects yield 25 pairs but 125 triples; 10 objects yield 100 pairs but 1,000 triples.

And empirically, pairwise comparisons handle a surprising range of relational questions. "Left of" is inherently pairwise. "Between" involves three objects, but you can approximate it by combining two pairwise relations: "A is right of B" and "A is left of C" together imply "A is between B and C." The sum in the RN aggregates these pairwise signals, enabling implicit higher-order reasoning.

Which of these questions is relational?

"What color is the large sphere?" "Are there any objects that have the same shape as the blue cylinder?" "Is there a red object in the scene?"

Chapter 2: The RN Equation

The entire Relation Network can be written in one line:

RN(O) = f_φ( Σ_i,j g_θ(o_i, o_j) )

That is it. One equation. No attention heads, no memory banks, no program executors. Let us unpack every symbol.

O = {o₁, o₂, ..., o_n} is a set of objects. Each object o_i is a vector — a list of numbers describing that object. For a CLEVR scene with 5 objects, O has 5 vectors.

g_θ is a small MLP (multi-layer perceptron). It takes a pair of objects (o_i, o_j) — concatenated into one long vector — and outputs a vector that represents "how these two objects relate." The paper calls this output a relation. The parameters θ are learned during training.

Σ_i,j sums over all ordered pairs. For n objects, that is n² pairs (including self-pairs like (o₁, o₁)). The sum is the aggregation step — it collects all pairwise relations into a single vector.

f_φ is another MLP that takes the aggregated relation vector and produces the final output (e.g., an answer to a question). Parameters φ are also learned.

Why this works: three structural guarantees.

1. All pairs are considered. The network cannot miss a relevant relationship because it evaluates every possible pair. It does not need to learn which pairs to look at.

2. One shared function. The same g_θ processes every pair. This means it learns a general "how do two objects relate?" function rather than memorizing pair-specific patterns. With n objects, an MLP would need to learn n² different relation functions embedded in its weights. The RN learns just one.

3. Order invariance. The sum is commutative. Shuffling the objects in O produces the same output. This is correct: the answer to "is the red ball left of the blue cube?" should not change if we relabel the objects.

Let us trace through a concrete example. Suppose we have 3 objects:

Object	Features
o₁ (red ball)	[1, 0, 0, 1, 0, 2.1, 3.4]
o₂ (blue cube)	[0, 0, 1, 0, 1, 5.7, 1.2]
o₃ (green cylinder)	[0, 1, 0, 0, 0, 4.0, 6.1]

The RN forms all 9 pairs: (o₁,o₁), (o₁,o₂), (o₁,o₃), (o₂,o₁), (o₂,o₂), (o₂,o₃), (o₃,o₁), (o₃,o₂), (o₃,o₃). Each pair is concatenated and fed through g_θ. The 9 output vectors are summed element-wise. The sum goes into f_φ, which produces the answer.

For n objects, we compute g_θ exactly n² times. With 5 CLEVR objects, that is 25 calls. With the CNN-based pixel pipeline (where "objects" are d×d grid cells), the number is much larger — but still tractable because g_θ is small.

In graph theory terms, this is equivalent to operating on a complete directed graph: every object is a node, and there is an edge from every node to every other node (including self-loops). The function g_θ is the edge function — it computes a vector for each edge. The sum aggregates all edge vectors into a graph-level representation, and f_φ maps that to the output.

Could we use fewer pairs? Yes — if we know which objects are likely to interact, we could provide only those pairs. The paper acknowledges this: "this RN definition can be adjusted to consider only some object pairs." For instance, in a physics simulation you might know that only nearby objects interact, and skip distant pairs. But the all-pairs approach has a decisive advantage: it requires zero prior knowledge about the structure of the problem. The network discovers which relations exist through training. This is what makes it a general-purpose module.

One subtlety: the sum includes self-pairs (o_i, o_i). At first glance this seems wasteful — what can an object learn from comparing itself to itself? But self-pairs allow g_θ to extract per-object features (akin to a unary function), which can be useful. In practice, the model learns to produce near-zero output for self-pairs when they are uninformative, so including them does no harm.

If a scene has 4 objects, how many times is g_θ evaluated?

4 (once per object) 8 (all unordered pairs) 16 (all ordered pairs, including self-pairs: 4²)

Chapter 3: From Pixels to Objects

The RN equation assumes we already have a set of objects. But images are not sets of objects — they are grids of pixels. How do we bridge the gap?

The paper's solution is elegant and perhaps surprising in its simplicity: run a standard CNN over the image, and treat every spatial location in the final feature map as an "object." No region proposals. No bounding boxes. No segmentation masks. Just grid cells.

Step 1: Convolve

A 128×128 image passes through 4 convolutional layers with 24 filters each. Output: a d×d grid of 24-dimensional feature vectors.

↓

Step 2: Tag with coordinates

Each cell in the d×d grid gets two extra features appended: its (x, y) position in the grid. This gives the RN spatial information.

↓

Step 3: Treat as objects

Each of the d² cells becomes an "object" — a 26-dimensional vector (24 CNN features + 2 coordinates). These go into the RN.

This is deliberately agnostic about what constitutes an "object." A cell might represent background, part of a sphere, the edge of a cube, or a conjunction of multiple things. It might even represent the boundary between two objects. The network does not need to segment the scene into semantic objects first. It lets the CNN learn whatever representations are useful, and the RN learns to reason about their relationships.

This is counterintuitive. How can you reason about "objects" when most of your "objects" are just patches of background? The answer: g_θ learns to produce near-zero output for uninformative pairs. A (background, background) pair contributes nothing to the sum. A (red ball region, blue cube region) pair contributes a lot. The model learns this automatically through training.

Objects do not need to be objects. This is one of the paper's most important insights. The word "object" in the RN equation is purely formal — it means "an element of the input set." For images, objects are feature map cells. For text, objects are sentence embeddings. For physics, objects are state vectors. The RN does not care. It just needs a set of things to compare pairwise.

There is a cost: with a d×d feature map, we have d² "objects," and the RN evaluates d⁴ pairs. The paper uses relatively small feature maps (d≈8), giving about 64 objects and 4,096 pairs — expensive but feasible. Each g_θ call is a small 4-layer MLP with 256 units per layer, so 4,096 forward passes through a small network is manageable on a GPU. The trick is that these passes can be batched: stack all 4,096 concatenated pair vectors into one matrix and run a single batched MLP forward pass.

Notice how different this is from traditional object detection pipelines. Systems like Faster R-CNN first detect objects, draw bounding boxes, extract features, and then reason about them. The RN skips all of that. It does not know what an "object" is. It just takes every spatial cell and compares everything to everything. The CNN and end-to-end training together figure out what useful representations to put in those cells.

This "let the network figure out what objects are" philosophy was radical in 2017. Most visual reasoning pipelines at the time used a two-stage approach: first detect and segment objects, then reason about them. The RN showed that an end-to-end approach — where object representations emerge from the learning signal — could outperform hand-engineered pipelines. The implicit object representations learned by the CNN may not correspond to neat bounding boxes, but they contain the information the RN needs.

The paper also tests a second mode: state descriptions, where each object is explicitly described as a feature vector (3D coordinates, color, shape, material, size). This bypasses the CNN entirely and feeds objects directly to the RN. It achieves even higher accuracy (96.4%), confirming that the RN itself is the key ingredient.

Why does the paper append (x, y) coordinate tags to CNN feature map cells?

Because the RN needs spatial information to reason about relations like "left of" or "closest to" — without coordinates, all positions look the same Because coordinates increase the dimensionality for better classification Because the CNN loses all position information during convolution

Chapter 4: Conditioning on Questions

The RN equation so far computes relations between objects. But in visual question answering, the meaning of a relation depends on the question. "Is the red ball left of the blue cube?" asks about spatial arrangement. "Is the red ball the same material as the blue cube?" asks about material properties. The RN must know which question is being asked.

The solution is to condition g_θ on the question. The modified equation becomes:

a = f_φ( Σ_i,j g_θ(o_i, o_j, q) )

where q is a question embedding — a vector that encodes the question's meaning. To produce q, the question's words are fed one at a time into an LSTM, and the final hidden state is used as the question embedding. Each word is first converted to a learned embedding vector via a lookup table.

Concretely, g_θ now receives the concatenation [o_i; o_j; q] as input. This means every pair evaluation is informed by what the question is asking. If the question is about color, g_θ can learn to focus on color features. If the question is about spatial position, g_θ can focus on coordinates.

Why concatenation rather than something more sophisticated, like bilinear pooling or cross-attention? Because g_θ is a universal function approximator (it is an MLP). Given the concatenated input, it can learn any function of the triple (o_i, o_j, q). A more complex combination scheme would add engineering complexity without increasing expressiveness. Simplicity wins.

Question-dependent attention, for free. By concatenating the question to every pair, the RN implicitly learns question-dependent attention. For "what color is the sphere left of the cube?", g_θ will produce large outputs only for the pair (sphere, cube) where the sphere is indeed to the left. All other pairs will produce near-zero contributions. The RN does not have an explicit attention mechanism — it is just an MLP — but the effect is the same.

The full pipeline for visual QA:

Image → CNN

128×128 image → 4 conv layers → d×d feature map → d² objects (with coordinate tags)

↓

Question → LSTM

Question words → word embeddings → LSTM → final hidden state q

↓

For each pair (o_i, o_j): concatenate [o_i; o_j; q], feed through g_θ. Sum all outputs. Feed through f_φ. Softmax over answer vocabulary.

The entire system — CNN, LSTM, and RN — is trained end-to-end with cross-entropy loss. This means the CNN learns to produce feature maps that are useful for relational reasoning, not just object recognition. The LSTM learns embeddings that tell the RN what to look for. Everything adapts together.

The model configuration is remarkably modest compared to the visual QA architectures of the day:

Component	Configuration
CNN	4 conv layers, 24 kernels each, ReLU + batch norm
Question LSTM	128 units, 32-dim word embeddings
g_θ (relation)	4-layer MLP, 256 units per layer, ReLU
f_φ (output)	3-layer MLP: 256, 256 (50% dropout), 29 units
Output	Softmax over 29-word answer vocabulary
Optimizer	Adam, lr = 2.5 × 10^-4

No ResNet-101. No VGG. No stacked attention modules. No 4,000-unit fully connected layers. The simplicity of the architecture is the point: the improvement comes from the structure of the computation (consider all pairs), not from scale.

How does the RN know which question is being asked?

Different g_θ networks are used for different question types The question embedding q is concatenated to every object pair before passing through g_θ An attention mechanism selects relevant objects before the RN processes them

Chapter 5: CLEVR & Sort-of-CLEVR

The paper's primary battleground is CLEVR — a dataset of rendered 3D scenes with objects of different shapes, sizes, colors, and materials. Each scene comes with questions that test different reasoning abilities.

Question Type	Example	Relational?
Query attribute	"What color is the large sphere?"	No
Exist	"Is there a small red cube?"	Sometimes
Count	"How many objects are the same shape as the blue thing?"	Yes
Compare attribute	"Is the cube the same material as the cylinder?"	Yes
Compare numbers	"Are there more cubes than cylinders?"	Yes

The results are striking:

Model	Overall	Count	Compare Attr.
Human	92.6%	86.7%	96.0%
CNN+LSTM+SA (best prior)	68.5%	52.2%	52.3%
CNN+LSTM+RN	95.5%	90.1%	97.1%

The RN does not just improve overall accuracy — it demolishes the gap on relational question types. On "compare attribute" questions, the previous best (stacked attention) scored 52.3%, barely above the baseline that guesses based on question type alone. The RN scores 97.1%. That is a 45 percentage point jump on the hardest category.

Look at the pattern in the table. On "query attribute" (non-relational), the stacked attention model already scores 85.3% — decent, because this only requires recognizing a single object. On "compare attribute" and "count" (relational), it collapses to ~52%. The RN scores above 90% on every category. The diagnosis is clear: the bottleneck was never vision or language processing. It was always relational reasoning.

Even more striking: the RN outperforms humans at 95.5% vs. 92.6%. This is not because the RN is smarter than humans. It is because humans make careless errors on tedious counting questions ("how many cubes are left of the red sphere?"), while the RN systematically evaluates every pair. The machine's advantage is exhaustiveness, not intelligence.

Sort-of-CLEVR: the controlled experiment. To prove that the improvement comes specifically from the RN module and not from other differences, the authors created Sort-of-CLEVR — a simpler dataset with 2D colored shapes. They tested CNN+MLP (no RN) vs. CNN+RN on both relational and non-relational questions. Result: both models solved non-relational questions easily (~94%). But on relational questions, CNN+MLP scored ~63% while CNN+RN scored ~94%. The RN is the difference.

The authors also tested with state descriptions instead of pixels — feeding object features (position, color, shape, material, size) directly into the RN without a CNN. This scored 96.4% overall, even higher than the pixel version. The message: when the bottleneck is relational reasoning, giving the model cleaner object representations helps. But even with messy CNN features, the RN still surpasses humans.

One more detail worth noting: the paper compared against a concurrent approach that used ground-truth functional programs as additional supervision (essentially telling the model the reasoning steps). Even that approach, with its privileged training signal, only reached 96.9%. The RN reached 95.5% with no program supervision — just images, questions, and answers. This is a strong result: the right architecture nearly matches explicit program supervision.

What did the Sort-of-CLEVR experiment prove?

That the RN module specifically enables relational reasoning — CNN+MLP matches RN on non-relational questions but fails dramatically on relational ones That larger models always perform better on visual QA That 2D shapes are easier to reason about than 3D shapes

Chapter 6: Text QA & Physics

The beauty of the RN is its generality. It does not care what the "objects" are. The same equation works for images, text, and physical systems. The paper demonstrates this across two additional domains.

bAbI: Text-Based QA

The bAbI suite contains 20 text reasoning tasks. Each task provides a set of supporting sentences and a question. For example:

"Mary went to the bathroom. John moved to the hallway. Mary traveled to the office."
Question: "Where is Mary?" → Answer: "office"

For text, the paper defines "objects" as follows: each sentence is processed by an LSTM, and the LSTM's final hidden state becomes one object. The sentences are tagged with their position in the support set (1st, 2nd, 3rd, ...) so the RN has temporal ordering information.

A separate LSTM encodes the question into an embedding q. Then the standard RN processes all sentence pairs, conditioned on q.

The RN-augmented model solved 18 out of 20 bAbI tasks (using the 95% threshold), matching the state-of-the-art Differentiable Neural Computer (DNC) on joint training with 10K examples per task. It failed only on two tasks: "basic induction" (inferring a general rule from examples) and "path finding" (multi-hop navigation). Both demand something the RN cannot do in a single pass: chain multiple reasoning steps together.

Note the minimal prior knowledge: the paper delineates objects as sentences. Previous bAbI models processed all words from all sentences in one long sequence. The sentence-as-object choice is a form of prior knowledge, but a very mild one — periods already delineate sentences, so the sequential models could in principle learn the same decomposition.

Dynamic Physical Systems

The third domain is the most visually striking. Ten colored balls move on a table. Some pairs are connected by invisible springs or rigid constraints. The task: infer which balls are connected, or count the number of independent systems.

Here, each object is a ball described by its color (RGB) and spatial coordinates (x, y) across 16 time steps. The RN considers all pairs of balls and must learn to detect correlated motion — if two balls maintain a consistent distance over time, they are probably connected.

The same module, three domains. Pixels with questions (CLEVR), sentences with questions (bAbI), and physical trajectories (mass-spring systems). The RN equation is identical in all three cases. Only the "object" definition changes. This universality is the paper's strongest argument that relational reasoning is a separable, modular capacity that can be plugged into any architecture.

For the physics tasks, the RN achieved 93.6% accuracy on connection inference and competitive accuracy on counting systems. The failures were interpretable: the model struggled when balls happened to move in similar directions without being connected (coincidental correlation).

The physics task is particularly interesting because it requires reasoning about temporal relations. Two balls maintaining a constant distance over 16 frames is evidence of a spring. Two balls accelerating toward each other is evidence of attraction. The RN handles this by treating the temporal trajectory as part of each object's feature vector — each ball's (x, y) coordinates across all 16 frames are concatenated into one long vector. Then g_θ compares two such trajectories to detect correlated motion.

This is a beautiful example of the RN's flexibility. For CLEVR, objects are CNN feature vectors. For bAbI, objects are LSTM hidden states. For physics, objects are spatiotemporal trajectories. The RN does not care. Its job is the same in every case: compare pairs and aggregate.

The counting task (how many independent systems?) is especially hard because it requires global reasoning. The RN must first determine all connections, then implicitly count connected components. A single pairwise pass can detect individual connections but counting components requires aggregating that information coherently. The RN manages this through the summation and f_φ, but it is near the limits of what a single-pass pairwise architecture can do.

In the bAbI text QA setup, what counts as an "object" for the Relation Network?

Each individual word in the support set Each character in the text Each sentence, encoded as the final hidden state of an LSTM that processed its words

Chapter 7: Showcase — Relational QA

Now let us see a Relation Network in action. Below is a simple scene with colored shapes. Select a question, and watch the RN compare every pair of objects to find the answer.

Click a question below, then watch how g_θ evaluates all pairs. Pairs with strong responses are highlighted — the RN has learned which pair matters for this question.

"What color is the object left of the blue circle?"

"What shape is the object closest to the green square?"

"How many objects are the same shape as the red circle?"

"Is there an object right of the yellow triangle?"

← Select a question above

What you are seeing. The RN evaluates g_θ on every ordered pair. Most pairs produce near-zero output (gray lines). But the pair(s) relevant to the question produce strong output (bright lines). The sum of all g_θ outputs goes into f_φ, which produces the answer. The network learned which pairs matter without being told — it figured it out from training data alone.

The grid at the bottom shows the output magnitude of g_θ(o_i, o_j, q) for every pair. Rows are o_i, columns are o_j. The diagonal (self-pairs) is grayed out. Bright cells indicate strong relational signals — these are the pairs that contribute most to the final answer.

Try each question and notice how different questions activate different pairs. Q1 ("left of the blue circle") lights up the (Red, Blue) pair because the red circle is the one spatially left of blue. Q2 ("closest to the green square") lights up (Blue, Green) because the blue circle is nearest. Q3 ("same shape as the red circle") lights up all pairs involving red, because the RN must compare red to every other object. Q4 ("right of the yellow triangle") lights up all pairs involving yellow, checking each object's position relative to yellow.

This is the RN's implicit attention mechanism. Without any explicit attention module, the MLP g_θ has learned to produce near-zero outputs for irrelevant pairs. The question embedding q acts as a steering signal: it tells g_θ what kind of relation to look for, and g_θ suppresses everything else.

In the showcase, why do most object pairs produce near-zero g_θ output?

Because g_θ has learned to suppress irrelevant pairs — only the pair(s) whose relation answers the question produce strong output Because most objects are too far apart to interact Because g_θ only processes adjacent objects

Chapter 8: Why It Works

The Relation Network's power comes from three structural properties that align perfectly with the nature of relational reasoning.

1. Combinatorial Coverage

By evaluating all n² pairs, the RN guarantees that no relevant relationship is missed. An MLP processing all objects as a flat vector must learn to decompose the input into pairs internally — a much harder task. The RN gets this decomposition for free from its architecture.

Think of it like this: an MLP is given a bag of puzzle pieces and must learn to compare them. It must learn how to pick up two pieces, which two to pick up, and what to check about them — all within its weight matrix. The RN is given a systematic layout of every possible pair comparison and just has to learn what to look for in each comparison.

The paper demonstrates this empirically with Sort-of-CLEVR. A CNN+MLP with enough parameters to theoretically represent the relation function still fails at ~63% on relational questions. The CNN+RN with fewer total parameters succeeds at ~94%. The architecture, not the capacity, makes the difference.

2. Weight Sharing

A single g_θ processes all pairs. This is analogous to how a CNN shares a single filter across all spatial locations. The filter learns a general "what is at this location?" function; g_θ learns a general "how do these two objects relate?" function.

Weight sharing provides two benefits. First, it reduces the number of parameters. Without sharing, we would need n² separate MLPs (one per pair position), which is both wasteful and impossible when n varies. Second, it forces generalization. Because g_θ must work for any pair, it cannot memorize pair-specific idiosyncrasies. It must learn genuinely general relational features like "these two have similar color" or "this one is to the left of that one."

3. Set Input / Order Invariance

The summation in the RN equation is commutative. This means the output is invariant to the ordering of objects in O. If you shuffle the objects, the same pairs are evaluated (in a different order), the same values are produced, and the same sum results. This is the correct inductive bias: the answer to a visual question should not depend on the arbitrary order in which we list the objects.

This is not just a mathematical nicety — it is essential for correctness. If you extracted objects from an image using a CNN, the order depends on how you scan the feature map (left-to-right, top-to-bottom). A different scanning order should not change the answer. The summation guarantees this.

Compare this with an MLP that takes all objects concatenated into one flat vector. Swapping two objects in the input changes the flat vector, and the MLP may produce a completely different output. The MLP would need to learn separately that [o₁; o₂] and [o₂; o₁] should give the same answer — using up capacity on a trivial symmetry.

Comparison with attention. Stacked attention mechanisms (the previous CLEVR champion) also compare objects pairwise. But attention uses a dot product to compute a scalar similarity, then uses that scalar to weight a value. The RN uses a full MLP, which can compute any function of the pair — not just similarity. This extra expressiveness is what allows the RN to handle "left of," "same material as," and "closest to" with the same function.

An analogy. Think of an MLP as a student who must answer a question about a group photo by looking at the entire photo at once and somehow figuring out which people to compare. The RN is a student who gets a stack of flash cards, one for each pair of people, and just has to answer "how do these two relate?" for each card. The second student's job is much easier — the structure of the comparison is given, and only the content needs to be learned.

Limitations

The RN has clear limitations too:

Quadratic scaling. The n² pair evaluations become expensive for large object sets. With 100 objects, that is 10,000 pairs. With 1,000, it is a million. Later work on efficient attention and sparse graph networks addresses this, but the original RN is inherently O(n²).

Pairwise only. Some relations are inherently higher-order. "A is between B and C" is a ternary relation. The RN can approximate it through aggregation of binary relations, but cannot represent it directly. Stacking RN layers (applying an RN to the output of an RN) can help, but adds complexity.

No multi-step reasoning. The bAbI "path finding" failure reveals this clearly. "How do you get from the kitchen to the garden?" requires chaining: kitchen → hallway → garden. A single-pass RN sees all sentence pairs but cannot chain intermediate conclusions. Memory-augmented networks and multi-hop architectures handle this better.

Despite these limitations, the paper's contribution is profound. It showed that for a large class of reasoning problems — those that can be decomposed into pairwise comparisons — a simple module with the right structure massively outperforms sophisticated architectures with the wrong structure. The ceiling of the approach is clear (multi-hop, higher-order), but within its domain, it is extraordinarily effective.

Why does the RN use a shared g_θ for all pairs instead of separate networks per pair?

Because weight sharing forces g_θ to learn a general relation function that generalizes across all pairs, and it reduces the parameter count dramatically Because separate networks would produce different outputs for the same pair Because modern GPUs cannot run multiple networks simultaneously

Chapter 9: Connections

Relation Networks sit at a pivotal junction in deep learning history. They bridged the gap between symbolic AI (which reasons about relations explicitly) and neural networks (which learn from data). Several lines of work converge here and diverge from here.

Interaction Networks (Battaglia et al., 2016). Published the year before, Interaction Networks also model pairwise interactions between objects for physics simulation. Two of the RN paper's authors (Battaglia and Lillicrap) were also on the Interaction Networks paper. RNs simplify the framework by dropping the explicit sender-receiver distinction and using a single MLP for all relations. The key RN contribution is showing that this simple formulation works across vision, language, and physics — not just physics alone.

Graph Neural Networks. The RN equation is equivalent to one step of message passing on a complete graph: each node sends a message to every other node, messages are aggregated by summation, and the graph-level readout produces the answer. This connection was later formalized by the MPNN framework (Gilmer et al., 2017) and the landmark "Relational inductive biases" paper (Battaglia et al., 2018), which unified RNs, GNNs, and Interaction Networks under one umbrella. The RN can be seen as the simplest possible GNN: one message passing step on a complete graph with no edge features.

Attention mechanisms. Self-attention in Transformers (Vaswani et al., 2017, published the same year) also computes pairwise interactions between all elements of a set. The difference: attention uses dot-product similarity to compute scalar weights, then takes a weighted average of value vectors. The RN uses a full MLP on each pair, which can compute any function of the pair, not just similarity. Transformers would later prove that the attention variant scales better to long sequences, but the core idea — consider all pairs — is deeply shared. In hindsight, Transformers and Relation Networks were two branches of the same tree, published the same year, solving the same fundamental problem from different angles.

Deep Sets (Zaheer et al., 2017). Also published the same year, Deep Sets proved that any permutation-invariant function on sets can be decomposed as ρ(Σ φ(x_i)). RNs extend this to pairwise interactions: ρ(Σ φ(x_i, x_j)). The two papers together established the theoretical foundations for set- and relation-processing neural networks.

Object-centric learning. The RN showed that you do not need explicit object detection to reason about objects — the CNN can learn to produce useful "object-like" representations when trained end-to-end with a relational module. This idea flowered into a line of work on object-centric representations: MONET, Slot Attention, and GENESIS all learn to decompose scenes into object slots, often paired with relational modules for downstream reasoning.

Modern large language models. It is worth noting that today's large language models (GPT-4, Claude, etc.) handle relational questions with ease — but they do so through massive scale and in-context learning, not through explicit relational structure. The RN paper's lesson remains relevant: for smaller, more efficient models, the right inductive bias can substitute for billions of parameters. In resource-constrained settings (robotics, edge devices), architectures with built-in relational reasoning may still outperform scaled-down general models.

The lasting impact. This paper has over 4,000 citations. Its influence is less about the specific RN module and more about the idea it crystallized: relational reasoning is a distinct capability that standard architectures lack, and it can be added through the right structural bias. Every modern architecture that processes sets of entities pairwise — from Transformers to graph networks to object-centric models — carries the DNA of this insight.

Paper details. "A simple neural network module for relational reasoning," Adam Santoro, David Raposo, David G.T. Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, Timothy Lillicrap. NeurIPS 2017. arXiv:1706.01427.

← Back to Veanors Hub

What is the key structural difference between Relation Networks and self-attention in Transformers?

RNs process pairs sequentially while Transformers process them in parallel RNs use a full MLP (g_θ) on each pair, while Transformers use dot-product similarity and weighted averaging — both consider all pairs but with different interaction functions Transformers can only process text while RNs can process images

Relation Networksfor Relational Reasoning