CNNs see pixels. LSTMs see sequences. But neither can answer "is the red ball left of the blue cube?" This paper introduced a devastatingly simple module — compare every pair of objects — that gave neural networks the ability to reason about relationships, reaching super-human accuracy on visual question answering.
Look at a table with a red ball, a blue cube, and a green cylinder. Now answer: "What shape is the object closest to the blue cube?"
You solved that instantly. You identified the blue cube, computed its distance to every other object, found the closest one, and reported its shape. Four operations, done in a flash. This is relational reasoning — reasoning about how entities relate to each other, not just what they are.
Now try to build a neural network that does the same thing. A convolutional neural network can recognize that there is a red ball in the image. It can tell you the ball is red. It can even localize the ball in the image. But ask "is the red ball left of the blue cube?" and the CNN struggles. Why?
Before this paper, the best approach to visual question answering on the CLEVR dataset — a benchmark specifically designed to test relational reasoning — scored 68.5%. Humans scored 92.6%. The gap was enormous, and it was entirely due to relational questions. On non-relational questions ("what color is the big sphere?"), models did reasonably well. On relational questions ("is the cube the same material as the cylinder?"), the best models performed barely above a baseline that just memorized answer frequencies. Researchers had thrown ResNets, VGGs, attention mechanisms, and huge fully connected layers at the problem. Nothing worked.
Santoro et al. proposed a module so simple it seems like it should not work: take every possible pair of objects, pass each pair through a small neural network, and sum the results. They called it a Relation Network (RN). It scored 95.5% on CLEVR — surpassing human performance.
The lesson of this paper is that architecture matters. Not more parameters, not more data, not a cleverer training trick. Just the right structural bias: consider all pairs. This is the same lesson that convolutions taught for spatial processing, and that recurrence taught for sequences. Each architectural innovation succeeds because it bakes in the right inductive bias for the problem at hand. For relational reasoning, the inductive bias is: compare everything to everything.
And the module is truly plug-and-play. The paper demonstrates it on three completely different domains — visual question answering, text-based question answering, and physical system reasoning — with the same core architecture. The only thing that changes is how "objects" are defined.
Before we write any equations, let us build intuition for what "relational reasoning" actually means. Consider three types of questions you could ask about a scene:
The first question is non-relational. You only need to look at one object. The second and third are relational. You must consider pairs of objects and reason about how they relate.
Here is the key insight: a relation is a property that emerges from considering two or more entities together. "Red" is a property of a single object. "Left of" is a property of a pair. "Between" is a property of a triple. You cannot determine "left of" by looking at either object alone — you need the positions of both.
Psychologists have studied relational reasoning for decades. It is considered a hallmark of human intelligence. Children develop relational thinking around age 3-4, and it is central to analogy, planning, and abstract thought. Standardized intelligence tests (like Raven's Progressive Matrices) are fundamentally tests of relational reasoning.
Yet before this paper, neural networks — even deep, powerful ones — had no built-in mechanism for relational reasoning. They had to learn it implicitly through massive parameter counts and lots of data. Usually, they failed. A CNN with millions of parameters, trained on millions of images, could not answer "is the red ball left of the blue cube?" The capacity was not the problem. The structure was.
The paper frames the problem cleanly: given a set of objects (whatever those may be — image regions, sentence encodings, physical entities), learn a function that reasons about their pairwise relations and aggregates those relations into an answer.
Notice two critical design choices in that framing. First, pairwise: the paper restricts attention to relations between pairs of objects, not triples or higher-order combinations. This is a simplification, but it turns out to be surprisingly powerful. Second, all pairs: the network does not try to guess which pairs matter. It considers every possible pair and lets the learning process figure out which relations are relevant.
Why not triples or higher? Pragmatism. For n objects, there are n² pairs but n³ triples. With even modest n, computing all triples becomes prohibitive — 5 objects yield 25 pairs but 125 triples; 10 objects yield 100 pairs but 1,000 triples.
And empirically, pairwise comparisons handle a surprising range of relational questions. "Left of" is inherently pairwise. "Between" involves three objects, but you can approximate it by combining two pairwise relations: "A is right of B" and "A is left of C" together imply "A is between B and C." The sum in the RN aggregates these pairwise signals, enabling implicit higher-order reasoning.
The entire Relation Network can be written in one line:
That is it. One equation. No attention heads, no memory banks, no program executors. Let us unpack every symbol.
O = {o1, o2, ..., on} is a set of objects. Each object oi is a vector — a list of numbers describing that object. For a CLEVR scene with 5 objects, O has 5 vectors.
gθ is a small MLP (multi-layer perceptron). It takes a pair of objects (oi, oj) — concatenated into one long vector — and outputs a vector that represents "how these two objects relate." The paper calls this output a relation. The parameters θ are learned during training.
Σi,j sums over all ordered pairs. For n objects, that is n² pairs (including self-pairs like (o1, o1)). The sum is the aggregation step — it collects all pairwise relations into a single vector.
fφ is another MLP that takes the aggregated relation vector and produces the final output (e.g., an answer to a question). Parameters φ are also learned.
Let us trace through a concrete example. Suppose we have 3 objects:
| Object | Features |
|---|---|
| o1 (red ball) | [1, 0, 0, 1, 0, 2.1, 3.4] |
| o2 (blue cube) | [0, 0, 1, 0, 1, 5.7, 1.2] |
| o3 (green cylinder) | [0, 1, 0, 0, 0, 4.0, 6.1] |
The RN forms all 9 pairs: (o1,o1), (o1,o2), (o1,o3), (o2,o1), (o2,o2), (o2,o3), (o3,o1), (o3,o2), (o3,o3). Each pair is concatenated and fed through gθ. The 9 output vectors are summed element-wise. The sum goes into fφ, which produces the answer.
For n objects, we compute gθ exactly n² times. With 5 CLEVR objects, that is 25 calls. With the CNN-based pixel pipeline (where "objects" are d×d grid cells), the number is much larger — but still tractable because gθ is small.
In graph theory terms, this is equivalent to operating on a complete directed graph: every object is a node, and there is an edge from every node to every other node (including self-loops). The function gθ is the edge function — it computes a vector for each edge. The sum aggregates all edge vectors into a graph-level representation, and fφ maps that to the output.
Could we use fewer pairs? Yes — if we know which objects are likely to interact, we could provide only those pairs. The paper acknowledges this: "this RN definition can be adjusted to consider only some object pairs." For instance, in a physics simulation you might know that only nearby objects interact, and skip distant pairs. But the all-pairs approach has a decisive advantage: it requires zero prior knowledge about the structure of the problem. The network discovers which relations exist through training. This is what makes it a general-purpose module.
One subtlety: the sum includes self-pairs (oi, oi). At first glance this seems wasteful — what can an object learn from comparing itself to itself? But self-pairs allow gθ to extract per-object features (akin to a unary function), which can be useful. In practice, the model learns to produce near-zero output for self-pairs when they are uninformative, so including them does no harm.
The RN equation assumes we already have a set of objects. But images are not sets of objects — they are grids of pixels. How do we bridge the gap?
The paper's solution is elegant and perhaps surprising in its simplicity: run a standard CNN over the image, and treat every spatial location in the final feature map as an "object." No region proposals. No bounding boxes. No segmentation masks. Just grid cells.
This is deliberately agnostic about what constitutes an "object." A cell might represent background, part of a sphere, the edge of a cube, or a conjunction of multiple things. It might even represent the boundary between two objects. The network does not need to segment the scene into semantic objects first. It lets the CNN learn whatever representations are useful, and the RN learns to reason about their relationships.
This is counterintuitive. How can you reason about "objects" when most of your "objects" are just patches of background? The answer: gθ learns to produce near-zero output for uninformative pairs. A (background, background) pair contributes nothing to the sum. A (red ball region, blue cube region) pair contributes a lot. The model learns this automatically through training.
There is a cost: with a d×d feature map, we have d² "objects," and the RN evaluates d4 pairs. The paper uses relatively small feature maps (d≈8), giving about 64 objects and 4,096 pairs — expensive but feasible. Each gθ call is a small 4-layer MLP with 256 units per layer, so 4,096 forward passes through a small network is manageable on a GPU. The trick is that these passes can be batched: stack all 4,096 concatenated pair vectors into one matrix and run a single batched MLP forward pass.
Notice how different this is from traditional object detection pipelines. Systems like Faster R-CNN first detect objects, draw bounding boxes, extract features, and then reason about them. The RN skips all of that. It does not know what an "object" is. It just takes every spatial cell and compares everything to everything. The CNN and end-to-end training together figure out what useful representations to put in those cells.
This "let the network figure out what objects are" philosophy was radical in 2017. Most visual reasoning pipelines at the time used a two-stage approach: first detect and segment objects, then reason about them. The RN showed that an end-to-end approach — where object representations emerge from the learning signal — could outperform hand-engineered pipelines. The implicit object representations learned by the CNN may not correspond to neat bounding boxes, but they contain the information the RN needs.
The paper also tests a second mode: state descriptions, where each object is explicitly described as a feature vector (3D coordinates, color, shape, material, size). This bypasses the CNN entirely and feeds objects directly to the RN. It achieves even higher accuracy (96.4%), confirming that the RN itself is the key ingredient.
The RN equation so far computes relations between objects. But in visual question answering, the meaning of a relation depends on the question. "Is the red ball left of the blue cube?" asks about spatial arrangement. "Is the red ball the same material as the blue cube?" asks about material properties. The RN must know which question is being asked.
The solution is to condition gθ on the question. The modified equation becomes:
where q is a question embedding — a vector that encodes the question's meaning. To produce q, the question's words are fed one at a time into an LSTM, and the final hidden state is used as the question embedding. Each word is first converted to a learned embedding vector via a lookup table.
Concretely, gθ now receives the concatenation [oi; oj; q] as input. This means every pair evaluation is informed by what the question is asking. If the question is about color, gθ can learn to focus on color features. If the question is about spatial position, gθ can focus on coordinates.
Why concatenation rather than something more sophisticated, like bilinear pooling or cross-attention? Because gθ is a universal function approximator (it is an MLP). Given the concatenated input, it can learn any function of the triple (oi, oj, q). A more complex combination scheme would add engineering complexity without increasing expressiveness. Simplicity wins.
The full pipeline for visual QA:
The entire system — CNN, LSTM, and RN — is trained end-to-end with cross-entropy loss. This means the CNN learns to produce feature maps that are useful for relational reasoning, not just object recognition. The LSTM learns embeddings that tell the RN what to look for. Everything adapts together.
The model configuration is remarkably modest compared to the visual QA architectures of the day:
| Component | Configuration |
|---|---|
| CNN | 4 conv layers, 24 kernels each, ReLU + batch norm |
| Question LSTM | 128 units, 32-dim word embeddings |
| gθ (relation) | 4-layer MLP, 256 units per layer, ReLU |
| fφ (output) | 3-layer MLP: 256, 256 (50% dropout), 29 units |
| Output | Softmax over 29-word answer vocabulary |
| Optimizer | Adam, lr = 2.5 × 10-4 |
No ResNet-101. No VGG. No stacked attention modules. No 4,000-unit fully connected layers. The simplicity of the architecture is the point: the improvement comes from the structure of the computation (consider all pairs), not from scale.
The paper's primary battleground is CLEVR — a dataset of rendered 3D scenes with objects of different shapes, sizes, colors, and materials. Each scene comes with questions that test different reasoning abilities.
| Question Type | Example | Relational? |
|---|---|---|
| Query attribute | "What color is the large sphere?" | No |
| Exist | "Is there a small red cube?" | Sometimes |
| Count | "How many objects are the same shape as the blue thing?" | Yes |
| Compare attribute | "Is the cube the same material as the cylinder?" | Yes |
| Compare numbers | "Are there more cubes than cylinders?" | Yes |
The results are striking:
| Model | Overall | Count | Compare Attr. |
|---|---|---|---|
| Human | 92.6% | 86.7% | 96.0% |
| CNN+LSTM+SA (best prior) | 68.5% | 52.2% | 52.3% |
| CNN+LSTM+RN | 95.5% | 90.1% | 97.1% |
The RN does not just improve overall accuracy — it demolishes the gap on relational question types. On "compare attribute" questions, the previous best (stacked attention) scored 52.3%, barely above the baseline that guesses based on question type alone. The RN scores 97.1%. That is a 45 percentage point jump on the hardest category.
Look at the pattern in the table. On "query attribute" (non-relational), the stacked attention model already scores 85.3% — decent, because this only requires recognizing a single object. On "compare attribute" and "count" (relational), it collapses to ~52%. The RN scores above 90% on every category. The diagnosis is clear: the bottleneck was never vision or language processing. It was always relational reasoning.
Even more striking: the RN outperforms humans at 95.5% vs. 92.6%. This is not because the RN is smarter than humans. It is because humans make careless errors on tedious counting questions ("how many cubes are left of the red sphere?"), while the RN systematically evaluates every pair. The machine's advantage is exhaustiveness, not intelligence.
The authors also tested with state descriptions instead of pixels — feeding object features (position, color, shape, material, size) directly into the RN without a CNN. This scored 96.4% overall, even higher than the pixel version. The message: when the bottleneck is relational reasoning, giving the model cleaner object representations helps. But even with messy CNN features, the RN still surpasses humans.
One more detail worth noting: the paper compared against a concurrent approach that used ground-truth functional programs as additional supervision (essentially telling the model the reasoning steps). Even that approach, with its privileged training signal, only reached 96.9%. The RN reached 95.5% with no program supervision — just images, questions, and answers. This is a strong result: the right architecture nearly matches explicit program supervision.
The beauty of the RN is its generality. It does not care what the "objects" are. The same equation works for images, text, and physical systems. The paper demonstrates this across two additional domains.
The bAbI suite contains 20 text reasoning tasks. Each task provides a set of supporting sentences and a question. For example:
For text, the paper defines "objects" as follows: each sentence is processed by an LSTM, and the LSTM's final hidden state becomes one object. The sentences are tagged with their position in the support set (1st, 2nd, 3rd, ...) so the RN has temporal ordering information.
A separate LSTM encodes the question into an embedding q. Then the standard RN processes all sentence pairs, conditioned on q.
The RN-augmented model solved 18 out of 20 bAbI tasks (using the 95% threshold), matching the state-of-the-art Differentiable Neural Computer (DNC) on joint training with 10K examples per task. It failed only on two tasks: "basic induction" (inferring a general rule from examples) and "path finding" (multi-hop navigation). Both demand something the RN cannot do in a single pass: chain multiple reasoning steps together.
Note the minimal prior knowledge: the paper delineates objects as sentences. Previous bAbI models processed all words from all sentences in one long sequence. The sentence-as-object choice is a form of prior knowledge, but a very mild one — periods already delineate sentences, so the sequential models could in principle learn the same decomposition.
The third domain is the most visually striking. Ten colored balls move on a table. Some pairs are connected by invisible springs or rigid constraints. The task: infer which balls are connected, or count the number of independent systems.
Here, each object is a ball described by its color (RGB) and spatial coordinates (x, y) across 16 time steps. The RN considers all pairs of balls and must learn to detect correlated motion — if two balls maintain a consistent distance over time, they are probably connected.
For the physics tasks, the RN achieved 93.6% accuracy on connection inference and competitive accuracy on counting systems. The failures were interpretable: the model struggled when balls happened to move in similar directions without being connected (coincidental correlation).
The physics task is particularly interesting because it requires reasoning about temporal relations. Two balls maintaining a constant distance over 16 frames is evidence of a spring. Two balls accelerating toward each other is evidence of attraction. The RN handles this by treating the temporal trajectory as part of each object's feature vector — each ball's (x, y) coordinates across all 16 frames are concatenated into one long vector. Then gθ compares two such trajectories to detect correlated motion.
This is a beautiful example of the RN's flexibility. For CLEVR, objects are CNN feature vectors. For bAbI, objects are LSTM hidden states. For physics, objects are spatiotemporal trajectories. The RN does not care. Its job is the same in every case: compare pairs and aggregate.
The counting task (how many independent systems?) is especially hard because it requires global reasoning. The RN must first determine all connections, then implicitly count connected components. A single pairwise pass can detect individual connections but counting components requires aggregating that information coherently. The RN manages this through the summation and fφ, but it is near the limits of what a single-pass pairwise architecture can do.
Now let us see a Relation Network in action. Below is a simple scene with colored shapes. Select a question, and watch the RN compare every pair of objects to find the answer.
Click a question below, then watch how gθ evaluates all pairs. Pairs with strong responses are highlighted — the RN has learned which pair matters for this question.
The grid at the bottom shows the output magnitude of gθ(oi, oj, q) for every pair. Rows are oi, columns are oj. The diagonal (self-pairs) is grayed out. Bright cells indicate strong relational signals — these are the pairs that contribute most to the final answer.
Try each question and notice how different questions activate different pairs. Q1 ("left of the blue circle") lights up the (Red, Blue) pair because the red circle is the one spatially left of blue. Q2 ("closest to the green square") lights up (Blue, Green) because the blue circle is nearest. Q3 ("same shape as the red circle") lights up all pairs involving red, because the RN must compare red to every other object. Q4 ("right of the yellow triangle") lights up all pairs involving yellow, checking each object's position relative to yellow.
This is the RN's implicit attention mechanism. Without any explicit attention module, the MLP gθ has learned to produce near-zero outputs for irrelevant pairs. The question embedding q acts as a steering signal: it tells gθ what kind of relation to look for, and gθ suppresses everything else.
The Relation Network's power comes from three structural properties that align perfectly with the nature of relational reasoning.
By evaluating all n² pairs, the RN guarantees that no relevant relationship is missed. An MLP processing all objects as a flat vector must learn to decompose the input into pairs internally — a much harder task. The RN gets this decomposition for free from its architecture.
Think of it like this: an MLP is given a bag of puzzle pieces and must learn to compare them. It must learn how to pick up two pieces, which two to pick up, and what to check about them — all within its weight matrix. The RN is given a systematic layout of every possible pair comparison and just has to learn what to look for in each comparison.
The paper demonstrates this empirically with Sort-of-CLEVR. A CNN+MLP with enough parameters to theoretically represent the relation function still fails at ~63% on relational questions. The CNN+RN with fewer total parameters succeeds at ~94%. The architecture, not the capacity, makes the difference.
A single gθ processes all pairs. This is analogous to how a CNN shares a single filter across all spatial locations. The filter learns a general "what is at this location?" function; gθ learns a general "how do these two objects relate?" function.
Weight sharing provides two benefits. First, it reduces the number of parameters. Without sharing, we would need n² separate MLPs (one per pair position), which is both wasteful and impossible when n varies. Second, it forces generalization. Because gθ must work for any pair, it cannot memorize pair-specific idiosyncrasies. It must learn genuinely general relational features like "these two have similar color" or "this one is to the left of that one."
The summation in the RN equation is commutative. This means the output is invariant to the ordering of objects in O. If you shuffle the objects, the same pairs are evaluated (in a different order), the same values are produced, and the same sum results. This is the correct inductive bias: the answer to a visual question should not depend on the arbitrary order in which we list the objects.
This is not just a mathematical nicety — it is essential for correctness. If you extracted objects from an image using a CNN, the order depends on how you scan the feature map (left-to-right, top-to-bottom). A different scanning order should not change the answer. The summation guarantees this.
Compare this with an MLP that takes all objects concatenated into one flat vector. Swapping two objects in the input changes the flat vector, and the MLP may produce a completely different output. The MLP would need to learn separately that [o1; o2] and [o2; o1] should give the same answer — using up capacity on a trivial symmetry.
The RN has clear limitations too:
Quadratic scaling. The n² pair evaluations become expensive for large object sets. With 100 objects, that is 10,000 pairs. With 1,000, it is a million. Later work on efficient attention and sparse graph networks addresses this, but the original RN is inherently O(n²).
Pairwise only. Some relations are inherently higher-order. "A is between B and C" is a ternary relation. The RN can approximate it through aggregation of binary relations, but cannot represent it directly. Stacking RN layers (applying an RN to the output of an RN) can help, but adds complexity.
No multi-step reasoning. The bAbI "path finding" failure reveals this clearly. "How do you get from the kitchen to the garden?" requires chaining: kitchen → hallway → garden. A single-pass RN sees all sentence pairs but cannot chain intermediate conclusions. Memory-augmented networks and multi-hop architectures handle this better.
Despite these limitations, the paper's contribution is profound. It showed that for a large class of reasoning problems — those that can be decomposed into pairwise comparisons — a simple module with the right structure massively outperforms sophisticated architectures with the wrong structure. The ceiling of the approach is clear (multi-hop, higher-order), but within its domain, it is extraordinarily effective.
Relation Networks sit at a pivotal junction in deep learning history. They bridged the gap between symbolic AI (which reasons about relations explicitly) and neural networks (which learn from data). Several lines of work converge here and diverge from here.
Interaction Networks (Battaglia et al., 2016). Published the year before, Interaction Networks also model pairwise interactions between objects for physics simulation. Two of the RN paper's authors (Battaglia and Lillicrap) were also on the Interaction Networks paper. RNs simplify the framework by dropping the explicit sender-receiver distinction and using a single MLP for all relations. The key RN contribution is showing that this simple formulation works across vision, language, and physics — not just physics alone.
Graph Neural Networks. The RN equation is equivalent to one step of message passing on a complete graph: each node sends a message to every other node, messages are aggregated by summation, and the graph-level readout produces the answer. This connection was later formalized by the MPNN framework (Gilmer et al., 2017) and the landmark "Relational inductive biases" paper (Battaglia et al., 2018), which unified RNs, GNNs, and Interaction Networks under one umbrella. The RN can be seen as the simplest possible GNN: one message passing step on a complete graph with no edge features.
Attention mechanisms. Self-attention in Transformers (Vaswani et al., 2017, published the same year) also computes pairwise interactions between all elements of a set. The difference: attention uses dot-product similarity to compute scalar weights, then takes a weighted average of value vectors. The RN uses a full MLP on each pair, which can compute any function of the pair, not just similarity. Transformers would later prove that the attention variant scales better to long sequences, but the core idea — consider all pairs — is deeply shared. In hindsight, Transformers and Relation Networks were two branches of the same tree, published the same year, solving the same fundamental problem from different angles.
Deep Sets (Zaheer et al., 2017). Also published the same year, Deep Sets proved that any permutation-invariant function on sets can be decomposed as ρ(Σ φ(xi)). RNs extend this to pairwise interactions: ρ(Σ φ(xi, xj)). The two papers together established the theoretical foundations for set- and relation-processing neural networks.
Object-centric learning. The RN showed that you do not need explicit object detection to reason about objects — the CNN can learn to produce useful "object-like" representations when trained end-to-end with a relational module. This idea flowered into a line of work on object-centric representations: MONET, Slot Attention, and GENESIS all learn to decompose scenes into object slots, often paired with relational modules for downstream reasoning.
Modern large language models. It is worth noting that today's large language models (GPT-4, Claude, etc.) handle relational questions with ease — but they do so through massive scale and in-context learning, not through explicit relational structure. The RN paper's lesson remains relevant: for smaller, more efficient models, the right inductive bias can substitute for billions of parameters. In resource-constrained settings (robotics, edge devices), architectures with built-in relational reasoning may still outperform scaled-down general models.
Paper details. "A simple neural network module for relational reasoning," Adam Santoro, David Raposo, David G.T. Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, Timothy Lillicrap. NeurIPS 2017. arXiv:1706.01427.