Order Matters: Seq2Seq for Sets

Chapter 0: The Problem

You have five numbers: {3, 1, 4, 1, 5}. Your job is to sort them. Simple for a human, but how do you feed them to a neural network?

You could line them up left to right: 3, 1, 4, 1, 5. Feed them into an LSTM encoder, then decode the sorted output: 1, 1, 3, 4, 5. This is the sequence-to-sequence approach, and it works. But there is something deeply wrong with it.

The input is a set. The numbers {3, 1, 4, 1, 5} and {5, 4, 3, 1, 1} and {1, 3, 1, 5, 4} are all the same input. The answer should be identical regardless of how you arrange them. But an LSTM reads left to right. It sees 3-then-1-then-4 as a fundamentally different sequence from 5-then-4-then-3. You are forcing an order onto data that has none.

The central tension: Seq2seq models are built for sequences. But many real problems involve sets — unordered collections where any arrangement is equally valid. Forcing an arbitrary order onto a set injects a false assumption into your model. This paper asks: does that assumption hurt? (Yes, dramatically.) And what can we do about it?

The problem cuts both ways. Sometimes the input is a set (like numbers to sort, or objects detected in an image). Sometimes the output is a set (like predicting which objects are in a scene, where the order of your predictions shouldn't matter). Sometimes both are sets.

Vinyals, Bengio and Kudlur tackle both sides. For input sets, they propose the Read-Process-and-Write architecture, which uses attention to build an order-invariant encoding. For output sets, they propose searching over orderings during training, letting the model discover the best order on its own.

Why is it problematic to feed a set of numbers into an LSTM in some arbitrary order?

Because LSTMs can only handle fixed-length inputs Because the LSTM treats the input as a sequence with meaningful order, injecting a false assumption about structure that doesn't exist Because LSTMs cannot process numerical data

Chapter 1: The Chain Rule Trap

Seq2seq models work by decomposing the output probability using the chain rule:

P(Y|X) = ∏_t=1^T P(y_t | y₁, ..., y_t-1, X)

This is mathematically exact. No approximations, no independence assumptions. You just predict one token at a time, each conditioned on everything before it. An LSTM implements this naturally: at each step, its hidden state summarizes all previous outputs, and a softmax predicts the next one.

But here is the catch. The chain rule works in any order. For three random variables, all of these are equally valid:

P(a, b, c) = P(a) · P(b|a) · P(c|a,b)
= P(c) · P(b|c) · P(a|b,c)
= P(b) · P(c|b) · P(a|b,c)

In theory, a sufficiently powerful model should learn the correct joint distribution regardless of which factorization you choose. In practice, it does not. The order you pick determines which conditional distributions the model must learn. Some conditionals are easy. Others are brutally hard.

The key insight: The chain rule is exact in principle but approximate in practice. Your model has finite capacity and finite training data. Some factorization orders produce conditionals that are much easier for an LSTM to learn than others. The "wrong" order can cost you 10+ points of accuracy.

Think about language. English sentences flow left to right, and each word is heavily predicted by the words before it. "The cat sat on the ___" strongly constrains the next word. But what if you modeled the sentence right to left? "the on sat cat The" — predicting "The" from "the on sat cat" is much harder because the conditioning context is unnatural. The dependencies flow against the grain.

For sequences like English, there is a natural order. But for sets, there is no natural order. Every order you pick is arbitrary. And as we will see, some arbitrary choices are catastrophically worse than others.

The chain rule decomposition of a joint probability is mathematically exact regardless of variable ordering. Why does ordering still matter in practice?

Because the chain rule only works for sequences, not sets Because the model has finite capacity and some orderings produce conditionals that are much harder to learn than others Because the chain rule introduces independence assumptions

Chapter 2: Input Order Matters

Before proposing a solution, the paper first demonstrates the problem with hard evidence. Consider three cases where changing the input order changed everything:

Machine translation. Sutskever et al. (2014) found that reversing the input English sentence before translating to French improved BLEU score by 5 points. Just reading "cats like I" instead of "I like cats" — same words, reversed order — produced dramatically better translations. Why? Because reversing puts the first English words closest to the first French words in the decoder, reducing the effective distance the LSTM must bridge.

Constituency parsing. When parsing English sentences into syntax trees, reversing the input sentence improved F1 by 0.5% absolute. Again, just flipping the reading order.

Convex hull computation. When computing the convex hull of a set of 2D points, sorting the input points by angle before feeding them to the model increased accuracy by up to 10% absolute. Sorting transformed the task from O(n log n) complexity to O(n), making it trivial for the network.

The pattern: In every case, the model sees the exact same data. The only thing that changes is the order. And the results swing wildly. This tells us something deep: order is not just a formatting choice. It is a prior on the problem structure. The right prior makes learning easy. The wrong prior makes it hard.

For naturally sequential data like language, we at least have a reasonable default order (left to right). But what about sets? If you have 15 numbers to sort, which of the 15! ≈ 1.3 trillion possible input orderings should you pick? The answer is: none of them. You need an architecture that does not care about order at all.

Input Order & Model Output

Drag items to reorder them. Watch how a simulated seq2seq encoder produces a different hidden state for each ordering of the same set — even though the set is identical. The bar chart shows the encoder's final hidden state vector.

Drag items to reorder

The simulation above illustrates the core problem. A sequential encoder (like an LSTM) produces a different internal representation for every ordering of the same set. This means the decoder sees a different "summary" of the input depending on an arbitrary choice that has nothing to do with the task. The model must waste capacity learning to be invariant to these spurious differences.

Reversing the input English sentence before translation improved BLEU by 5 points. What does this suggest about seq2seq models?

The order in which data is presented to the encoder significantly affects learning, even when the data content is identical Reversed English is closer to French grammar LSTMs work better when reading backwards

Chapter 3: Attention as Memory

How do we build an encoder that genuinely does not care about input order? The simplest approach is the bag of words: just add up the embeddings of all input elements. Adding is commutative, so {A, B, C} and {C, A, B} produce the same sum. Problem solved?

Not quite. Addition throws away all structural information. The sum of embeddings for {3, 7} is the same as for {5, 5} if 3 + 7 = 5 + 5 in embedding space. Worse, the representation has a fixed dimensionality regardless of how many items are in the set. You are trying to cram an arbitrarily large set into a single fixed-size vector. For large sets, information is inevitably lost.

What we need: A representation that (1) is invariant to input order, (2) can scale its effective memory with set size, and (3) can capture interactions between set elements. Simple aggregation fails on all three counts for complex tasks.

The solution is content-based attention. Instead of crushing all inputs into a single vector, we keep them all around as a memory bank. When the model needs information, it queries this memory and retrieves a weighted combination of the stored items. The key property: the attention mechanism computes a weighted sum over memory slots, and a weighted sum is invariant to the order of the slots.

Here is how it works. We have memory vectors m₁, ..., m_n (one per input element) and a query vector q. The attention mechanism computes:

e_i = f(m_i, q)     (score each memory)
a_i = exp(e_i) / ∑_j exp(e_j)     (softmax to get weights)
r = ∑_i a_i m_i     (weighted readout)

Where f is a scoring function (e.g., a dot product). The readout r is a weighted average of the memories. If you permute the memories — swap m_i and m_j — the weights a_i and a_j also swap, and the sum r stays the same. Order invariance is guaranteed by the mathematics of weighted summation.

But a single attention readout may not capture everything. What if the model needs to reason about relationships between items? That requires multiple rounds of attention — reading the memory repeatedly, each time with a different query that has been updated by what was read previously.

Why is content-based attention naturally order-invariant?

Because it uses an LSTM to read the inputs Because it only looks at one input at a time Because the weighted sum over memory slots produces the same result regardless of the order of the slots

Chapter 4: Read-Process-Write

The paper's main architectural contribution is the Read-Process-and-Write model. It has three stages:

Read

Embed each input element x_i into a memory vector m_i using a shared neural network. Each element gets its own slot.

↓

Process

An LSTM with no external input runs for T steps, repeatedly attending to the memory vectors. Each step refines its understanding of the set.

↓

Write

A decoder LSTM produces the output sequence, using a pointer mechanism to select elements from the memory.

The Process block is the clever part. It is an LSTM that takes no inputs and produces no outputs. It just thinks. At each of its T processing steps, it:

q_t = LSTM(q*_t-1)     (update query state)
e_i,t = f(m_i, q_t)     (score each memory)
a_i,t = softmax(e_i,t)     (attention weights)
r_t = ∑_i a_i,t m_i     (read from memory)
q*_t = [q_t ; r_t]     (concatenate query + readout)

After T steps, the final state q*_T is a rich, order-invariant summary of the entire input set. The number of processing steps T is a hyperparameter: more steps let the model perform more complex reasoning about relationships within the set.

Why "no input" to the LSTM? The process LSTM does not read any external sequence. Its only source of information is the attention readout from the memory bank. This is crucial: if it read the inputs sequentially, it would depend on their order. By reading only through attention (which is order-invariant), the entire computation is order-invariant.

The Write block uses a pointer network: instead of outputting tokens from a fixed vocabulary, it points at specific input elements. For sorting, this means it points at the input numbers in sorted order. The write block can also use an additional attention step (called a glimpse) before each pointer output, which the paper found significantly improves performance.

You can think of the entire architecture as a special case of a Neural Turing Machine or Memory Network, but specifically designed to guarantee permutation invariance of the input encoding.

The Process block in Read-Process-Write is an LSTM with "no inputs." Where does it get information about the input set?

From the initial hidden state From attention readouts over the memory vectors at each processing step It doesn't — it generates information from scratch

Chapter 5: Sorting Numbers

The paper tests the Read-Process-Write architecture on a clean synthetic task: sorting N random numbers between 0 and 1. This is a pure set-to-sequence problem. The input is an unordered set. The output is a specific sequence (the sorted order).

They compare two approaches:

	Ptr-Net (baseline)	Read-Process-Write
Encoder	LSTM reads numbers sequentially	Embed each number, then Process block with attention
Decoder	Pointer network	Pointer network (same)
Order invariant?	No	Yes

The results tell a clear story:

N	Ptr-Net	P=0 steps	P=1 step	P=5 steps	P=10 steps
5	90%	84%	92%	94%	94%
10	28%	30%	44%	57%	50%
15	4%	2%	5%	4%	10%

(All results with glimpses enabled. Accuracy = fraction of sequences sorted perfectly.)

Three observations: (1) With zero processing steps, the model is worse than the baseline — the write block is effectively unconditioned on the input. (2) With even one processing step, it surpasses the baseline. (3) More processing steps generally help, especially for harder problems. The Process block is doing genuine computation over the set.

Notice how rapidly accuracy drops with N. Sorting 5 numbers is nearly solved (94%). Sorting 15 is still very hard (10%). This is a combinatorial problem: the number of possible orderings grows as N!, and the model must find the single correct one. Still, the Read-Process-Write architecture consistently outperforms the sequential baseline on this set-to-sequence task.

The glimpse mechanism also matters enormously. Without glimpses, the best N=10 accuracy is only 19%. With glimpses, it reaches 57%. The glimpse is an extra attention step that lets the Write block look at the memory between each pointer output, providing fine-grained, context-dependent information.

In the sorting experiment, what happens when the Read-Process-Write model uses zero processing steps?

It performs worse than the sequential baseline because the write block is effectively unconditioned on the input It still outperforms the baseline due to the read block It achieves the same accuracy as with many processing steps

Chapter 6: Output Order Matters

We have handled input sets. Now let us flip the problem. What happens when the output is a set?

The chain rule forces you to produce outputs one at a time, in some order. But if the output is a set, every ordering is equally valid. The sorted output {1, 3, 5} could be produced as 1→3→5 or 5→3→1 or 3→1→5 — all represent the same answer. Which factorization should the model learn?

The paper demonstrates the impact with two experiments:

Language modeling. They train LSTMs on Penn Treebank text in three orderings:

Order	Example	Perplexity
Natural	"This is a sentence ."	86
Reversed	". sentence a is This"	86
3-word scramble	"a is This <pad> . sentence"	96

Natural and reversed perform identically — both preserve local word dependencies (just mirrored). But the 3-word scramble, which breaks n-gram structure, costs 10 perplexity points. The model's capacity is wasted trying to learn scrambled conditionals.

Constituency parsing. A parse tree can be linearized in two ways: depth-first traversal or breadth-first traversal. Same tree, same information, different orderings. Depth-first achieves 89.5% F1. Breadth-first drops to 81.5% — an 8-point gap from ordering alone.

Why does depth-first win? Depth-first traversal keeps related tree nodes close together in the output sequence. A parent is immediately followed by its children. The LSTM can model these local dependencies easily. Breadth-first scatters related nodes far apart, forcing the LSTM to maintain long-range dependencies across entire tree levels. Same information, but one ordering aligns with the LSTM's inductive bias and the other fights against it.

For combinatorial problems like sorting, the situation is even more dramatic. If you treat the output indices as a set and train with random orderings, the model must place equal probability on all N! valid orderings for every input. This is catastrophically inefficient: for N=5, there are 120 valid outputs for each input. The model's probability mass is spread thin, and convergence is painfully slow or impossible.

Depth-first traversal of a parse tree achieves 89.5% F1, while breadth-first achieves 81.5%. Why does output ordering cause such a large gap?

Because breadth-first trees contain less information Because LSTMs cannot produce breadth-first sequences Because depth-first keeps related nodes close together, matching the LSTM's strength at modeling local dependencies

Chapter 7: Searching Over Orders

If the output order matters but we do not know the best one, can we let the model find it?

The paper proposes a surprisingly simple idea. Instead of maximizing log-probability under a fixed ordering, maximize over all possible orderings:

θ* = arg max_θ ∑_i max_π log p(Y_π | X_i; θ)

For each training example, find the ordering π that gives the highest probability under the current model, and train on that ordering. The model simultaneously learns the parameters and discovers the best order.

But there are N! possible orderings. You cannot try them all. The paper addresses this with two tricks:

Trick 1: Pre-train uniformly

For the first 1000 steps, train on a uniform mixture of all orderings. This prevents the model from locking onto a random ordering determined by initial weights.

↓ then

Trick 2: Sample, don't search

Instead of searching over all N! orderings, sample an ordering proportional to p(Y_π|X). This costs O(1) model evaluations via ancestral sampling, vs O(N!) for exhaustive search.

They test this on 5-gram language modeling. Each 5-gram "This is a five gram" is converted to a set of (word, position) tuples: {(This,1), (is,2), (a,3), (five,4), (gram,5)}. The model must produce these tuples, but can choose any ordering.

Setup	Orderings considered	Perplexity
Natural order (1,2,3,4,5)	1	225
Scrambled (5,1,3,4,2)	1	280
Easy search (2 options)	2	225
Full search (5! options)	120	225

The result: When searching over all 120 orderings, the model converges to the natural order (1,2,3,4,5) or its reverse (5,4,3,2,1), achieving the same optimal perplexity as if we had known the best order in advance. The model discovers that natural word order is optimal without being told.

This is remarkable. The model is given a set with no ordering information. Through training dynamics alone, it discovers that left-to-right (or right-to-left) English word order minimizes perplexity. The chain rule's factorization order is not just a modeling choice — the model can learn the best one.

Order Search: Finding the Best Factorization

Each bar is one of the possible orderings of a set. Heights show log-probability under the current model. Watch as training progresses: the model converges to prefer one ordering over all others. Click Train to step through the process.

Step 0 — uniform prior

When searching over all 5! orderings for 5-gram modeling, which ordering does the model converge to?

A random ordering that depends on initialization The natural word order (1,2,3,4,5) or its reverse, matching the optimal perplexity of 225 An ordering where the most common word comes first

Chapter 8: Language & Parsing Results

The paper validates its ideas on standard benchmarks, not just synthetic tasks.

Language modeling on Penn Treebank. The key finding is not about achieving state-of-the-art perplexity, but about demonstrating the ordering effect at scale. Natural order and reversed order match at 86 perplexity. The 3-word scramble degrades to 96. Even the mighty LSTM cannot fully compensate for a bad ordering — and training perplexity is also 10 points higher, confirming that the model struggles to learn the scrambled conditionals, not just to generalize them.

Constituency parsing. The depth-first vs breadth-first comparison (89.5% vs 81.5% F1) shows that output ordering is not a minor detail. It is an architectural decision with consequences comparable to changing the model size or training data.

Graphical model estimation. The paper generates star-shaped graphical models: one "head" variable connected to several "leaf" variables. The leaves are conditionally independent given the head. They train LSTMs to model the joint probability in two orderings: head-first vs head-last.

Head-first wins. When the head variable is produced first, each subsequent leaf is conditioned on it — matching the true causal structure. When the head is last, the model must implicitly infer the head from the leaves before it can use it. With enough data, both orderings work. But with limited data (the realistic case), the ordering that matches the causal structure converges much faster and more reliably.

This is perhaps the deepest insight of the paper. The best ordering for the chain rule is the one that aligns with the causal structure of the data. When causes come before effects in the factorization, each conditional is simple. When effects come before causes, the conditionals become complex marginalizations that are hard for a finite-capacity model to learn.

In the graphical model experiment, why does producing the "head" variable first lead to better results?

Because it aligns the chain rule factorization with the true causal structure: each leaf is simply conditioned on the head Because the head variable is always easier to predict Because LSTMs can only model tree-shaped distributions

Chapter 9: Connections

Pointer Networks (Vinyals et al., 2015). The Write block of Read-Process-Write uses pointer networks as its output mechanism. Instead of selecting from a fixed vocabulary, it points at input elements. This is essential for combinatorial problems where the output is a permutation of the input.

Neural Turing Machines (Graves et al., 2014) and Memory Networks (Weston et al., 2015). The Read-Process-Write architecture can be viewed as a special case of these external-memory architectures. The memory bank stores input embeddings, and the Process block reads from it using content-based addressing. The key specialization is the focus on permutation invariance.

Set functions and Deep Sets. This paper is an early step toward the later "Deep Sets" framework (Zaheer et al., 2017), which formalized permutation-invariant functions. The insight that set encodings must be order-invariant was foundational. Deep Sets proved that any permutation-invariant function can be decomposed as ρ(∑ φ(x_i)), but the Read-Process-Write model shows that multiple rounds of attention can be more powerful than a single sum.

Transformers (Vaswani et al., 2017). The Process block — multiple rounds of self-attention over a set of memory vectors — anticipates a core idea of Transformers. In a Transformer, every layer performs attention over all positions. Positional encodings are added to inject order; without them, a Transformer is a set function. The connection is direct: both architectures process sets through iterative attention.

This Paper (2016)

Attention over sets, order-invariant encoding, order search for outputs

↓ influenced

Deep Sets (2017)

Formal theory of permutation-invariant neural networks

↓ parallels

Transformers (2017)

Multi-head self-attention over positions — inherently a set operation + positional encoding

The lasting contribution. This paper established two ideas that became foundational: (1) the representation of inputs matters as much as the model architecture, and (2) the ordering of outputs in autoregressive models is a design choice with major consequences. These insights echo through modern work on set prediction (DETR for object detection), non-autoregressive generation, and the design of Transformer architectures that treat input tokens as sets with learned or sinusoidal position encodings.

Paper impact. Published at ICLR 2016, this paper influenced the development of set-based neural architectures, non-autoregressive decoding strategies, and the understanding of inductive biases in sequence models. Every time you add positional encodings to a Transformer, you are acknowledging the insight from this paper: without explicit order information, attention treats its inputs as a set.

← Back to Veanors Hub

How does the Read-Process-Write architecture relate to Transformers?

They are unrelated architectures Both process sets through iterative attention; Transformers add positional encodings to inject order, while Read-Process-Write is inherently order-invariant Transformers replaced Read-Process-Write entirely, making it obsolete

Order Matters:Sequence to Sequencefor Sets