What if a neural network could read from and write to an external memory, like a computer uses RAM? This paper made memory differentiable — and taught networks to learn algorithms.
Imagine you need to copy a sequence of numbers. Someone reads you ten digits: 3, 7, 1, 4, 9, 2, 6, 8, 0, 5. Now repeat them back.
Easy, right? You store them in short-term memory, then read them out. You do not try to compress the digits into some internal feeling about what they mean. You simply remember the raw data and recall it. Your brain has a dedicated workspace for this — psychologists call it working memory. It holds a small amount of information (famously, "seven plus or minus two" items) and manipulates it according to rules.
Now try doing this with an LSTM.
An LSTM has a fixed-size hidden state — say, 256 numbers. It must squeeze everything it has ever seen into that vector. For short sequences, it works. But ask it to copy 100 random vectors, and it crumbles. The hidden state cannot hold that much raw data. The information gets compressed, mixed, and eventually lost.
This is not a failure of training or optimization. It is a fundamental architectural constraint. You are trying to store 100 × 8 = 800 bits of arbitrary data in a 256-dimensional continuous vector. Information theory tells us this is a losing game — the hidden state simply does not have enough capacity to faithfully represent all that data.
This is not just a limitation of LSTMs. It is a limitation of all standard neural network architectures. Feedforward networks have no memory at all — they process each input independently. RNNs have memory (the hidden state), but it is entangled with computation. The hidden state must simultaneously serve as both "what I remember" and "what I am currently computing." As tasks become more complex, these two roles conflict.
Computers solved this problem decades ago. A CPU has a tiny set of registers (like an LSTM's hidden state), but it also has RAM — a large, addressable memory bank. The CPU writes data to specific addresses in RAM and later reads it back. The registers handle computation; the RAM handles storage. Separating storage from processing is one of the most powerful principles in computer architecture — it is what makes computers able to handle arbitrary amounts of data with a fixed-size processor.
Graves, Wayne, and Danihelka asked: can we give a neural network its own RAM? And more importantly, can we make the whole thing differentiable, so the network learns how to use its memory through gradient descent?
The answer was the Neural Turing Machine (NTM): a neural network coupled to an external memory matrix, interacting with it through learned attention mechanisms. The name is a deliberate homage to Alan Turing's theoretical machine — a finite controller with access to an infinite tape. The NTM is the same idea, made differentiable.
The challenge was not conceptual — the idea of giving neural networks external memory is obvious. The challenge was engineering: how do you make memory access differentiable? A conventional memory lookup says "give me the data at address 7." That is a discrete operation with no gradient. You cannot backpropagate through "pick address 7" because there is no smooth function mapping a continuous input to a discrete address choice.
The NTM's breakthrough was replacing hard address selection with soft attention: instead of picking one address, attend to all addresses simultaneously with different weights. The gradient flows through the weights, and the network learns which addresses to attend to. This idea — making a discrete selection continuous by replacing it with a weighted average — would become one of the most influential techniques in deep learning, underpinning attention mechanisms, mixture-of-experts routing, and differentiable programming more broadly.
The NTM has two components: a controller and a memory matrix.
The controller is a neural network — either a feedforward network or an LSTM. It receives input from the outside world, produces output, and most importantly, it interacts with the memory matrix through read heads and write heads.
The memory matrix M is an N × M matrix. Think of it as N rows, each storing an M-dimensional vector. N is the number of memory locations (addresses), and M is the width of each memory slot. In the paper's experiments, N = 128 locations, each holding a 20-dimensional vector. That is 2,560 floating-point numbers of working storage — not a lot by modern standards, but enough to store and manipulate sequences of over 100 vectors.
The key distinction from a standard RNN: the controller and memory are separate. The controller's weights are fixed after training (they encode the "program"). The memory contents change at every time step (they hold the "data"). This separation is exactly the stored-program architecture that Von Neumann described in 1945, now implemented with differentiable components.
At each time step, the controller does three things:
1. Read. Each read head produces a weighting over memory locations — a vector of N non-negative values that sum to 1. This weighting acts like a soft attention distribution. The read head returns a weighted sum of all memory rows. If the weighting is sharp (concentrated on one location), you get that location's content. If it is diffuse, you get a blend.
2. Write. Each write head also produces a weighting, plus two vectors: an erase vector (which elements to clear) and an add vector (what new data to store). This two-step erase-then-add process is borrowed directly from LSTM's forget-then-input gating mechanism, generalized to an external memory bank.
3. Output. The controller produces an output for the external world, informed by what it just read from memory. The read vector from the previous step is typically concatenated with the current input and fed into the controller, so the controller always knows what it last retrieved from memory.
The controller can be either feedforward or recurrent (LSTM). A feedforward controller has no internal memory — it relies entirely on the external memory matrix. An LSTM controller has its own hidden state in addition to the memory matrix, like having both CPU registers and RAM. The paper found that both work, but LSTM controllers are more robust because they can buffer information internally across time steps.
There is an important subtlety about feedforward controllers. With a single read head, a feedforward controller can only perform a unary operation on one memory vector per step. If you need to compare two memory locations (say, for sorting), you need two read heads. An LSTM controller can work around this by caching previous reads in its hidden state, but a feedforward controller cannot. This is why the priority sort task required 8 heads with a feedforward controller but only 5 with LSTM.
Reading is beautifully simple. You have a memory matrix Mt with N rows, and a weighting vector wt with N elements that sum to 1. The read operation returns a weighted combination of all memory rows:
Where Mt(i) is the i-th row of memory (an M-dimensional vector), and wt(i) is the attention weight on location i. The constraints on w are simple: every element must be between 0 and 1, and they must sum to 1. In other words, w is a probability distribution over memory locations. This makes the read operation a convex combination — the result always lies "within" the cloud of memory vectors, never outside it.
Let's work through a concrete example. Suppose we have 4 memory locations, each storing a 3-dimensional vector:
| Location | Content | Weight w(i) |
|---|---|---|
| 0 | [1.0, 0.0, 0.5] | 0.1 |
| 1 | [0.0, 1.0, 0.0] | 0.7 |
| 2 | [0.5, 0.5, 1.0] | 0.1 |
| 3 | [0.0, 0.0, 0.0] | 0.1 |
The read vector is:
r = 0.1×[1,0,0.5] + 0.7×[0,1,0] + 0.1×[0.5,0.5,1] + 0.1×[0,0,0]
= [0.1, 0, 0.05] + [0, 0.7, 0] + [0.05, 0.05, 0.1] + [0, 0, 0]
= [0.15, 0.75, 0.15]
Because w(1) = 0.7 dominates, the result is close to location 1's content [0, 1, 0]. The read head is mostly "looking at" location 1, with a bit of blurring from other locations.
Notice that this is exactly the same operation as attention in sequence-to-sequence models. In Bahdanau attention (2014), the decoder computes a weighted sum of encoder hidden states. In the NTM, the controller computes a weighted sum of memory rows. The mathematical form is identical — the NTM just applies it to an external memory bank instead of another neural network's hidden states. This is not a coincidence: the NTM and Bahdanau attention were developed around the same time, and both represent the same fundamental insight about differentiable information retrieval.
Writing is more complex than reading. Inspired by LSTM's input and forget gates, the paper decomposes each write into two steps: erase, then add.
Step 1: Erase. The write head produces an erase vector et with M elements, each in the range (0, 1). The memory is erased pointwise:
If both the weight wt(i) and the erase element et(j) are 1, that element of that memory location is set to zero. If either is 0, the memory is unchanged. This gives fine-grained control: you can erase specific elements of specific locations.
Step 2: Add. The write head produces an add vector at with M elements. After erasing, the new data is written:
The weight wt(i) controls where the data goes. The add vector at controls what data is written. Locations with high weight receive the full add vector; locations with low weight receive almost nothing.
There is an elegant symmetry with LSTM gates. The LSTM's forget gate decides what to erase from the cell state (multiplication by a value in [0, 1]). The input gate decides what to add. The NTM's erase and add vectors play exactly the same roles, but applied to an external memory matrix rather than an internal cell state. The NTM generalizes the LSTM's gated memory to an addressable, multi-location memory bank.
Let's trace a concrete write. Suppose location 2 currently holds [0.5, 0.5, 1.0], the write weight on location 2 is 0.9, the erase vector is [1, 1, 0], and the add vector is [0.3, 0.8, 0.0].
Erase: M̃(2) = [0.5, 0.5, 1.0] × [1 − 0.9 × [1, 1, 0]] = [0.5, 0.5, 1.0] × [0.1, 0.1, 1.0] = [0.05, 0.05, 1.0]
Add: M(2) = [0.05, 0.05, 1.0] + 0.9 × [0.3, 0.8, 0.0] = [0.05, 0.05, 1.0] + [0.27, 0.72, 0.0] = [0.32, 0.77, 1.0]
The first two elements were erased and replaced; the third element (erase = 0) was left untouched. The network learned to selectively update only the parts of memory it wanted to change.
An important property: when multiple write heads are present, the erasures can be performed in any order because multiplication is commutative. Similarly, the adds are order-independent. This means the system is well-defined regardless of how we schedule multiple heads — a valuable property for implementation and for theoretical analysis of the system's behavior.
We know that read and write operations use a weighting vector over memory locations. But how is that weighting produced? The NTM uses two complementary addressing mechanisms. The first is content-based addressing.
The idea: find memory locations whose contents are similar to a query. This is exactly how you recall a phone number — you think of the person's name, and your brain retrieves the associated number. You address by content, not by position. This is also how Hopfield networks (1982) work: store patterns in a weight matrix, then retrieve them by providing a partial or noisy version of the pattern. The NTM makes the same principle differentiable and integrates it into an end-to-end trainable system.
Each head produces a key vector kt (length M) and a key strength βt (a positive scalar). The content-based weighting is computed by comparing the key to every row in memory using cosine similarity, then applying a softmax sharpened by β:
Where K is cosine similarity:
This is just a softmax over similarity scores, weighted by β. Let's unpack what β does.
When β is small (close to 0), the softmax output is nearly uniform — the head attends to all locations roughly equally, regardless of similarity. When β is large (say, 100), the softmax becomes very sharp, concentrating almost all weight on the single most similar location. The controller learns to adjust β to control how precise its memory lookups should be.
Content-based addressing is powerful for associative recall: you store a key-value pair, then later retrieve the value by providing a similar key. The paper tested this with an associative recall task, and the NTM learned to do it perfectly. This is directly analogous to a hash table lookup, except it is fuzzy: the query does not need to exactly match the key, just be similar enough for cosine similarity to identify the right location.
You might wonder: why cosine similarity rather than, say, Euclidean distance? Cosine similarity measures the angle between vectors, ignoring their magnitude. This makes it invariant to scaling — a memory vector that is "loud" (large magnitude) and one that is "quiet" (small magnitude) can still match perfectly if they point in the same direction. This is important because the controller and memory may operate at different scales.
But content-based addressing has a limitation. What if the content of a memory location is arbitrary? In a copy task, the data is random binary vectors — there is no meaningful content to search for. You just need to write to "the next empty slot" and later read from "slot 1, then slot 2, then slot 3." You need location-based addressing.
Location-based addressing lets the head move through memory like a tape head on a Turing machine: step forward, step backward, or stay in place. It uses three mechanisms: interpolation, shift, and sharpening.
Step 1: Interpolation gate gt. The controller emits a scalar gt ∈ (0, 1) that blends between the content-based weighting and the previous time step's weighting:
If g = 1, the head uses pure content addressing (fresh lookup). If g = 0, the head ignores content entirely and reuses last step's weighting. This is critical for iteration: once you have found the starting location, set g = 0 and just shift from where you were.
Step 2: Convolutional shift st. The controller emits a shift distribution st over allowed shifts (typically {-1, 0, +1}). The gated weighting is rotated via circular convolution:
If s = [0, 1, 0] (all weight on shift 0), the weighting does not move. If s = [0, 0, 1] (all weight on shift +1), the entire weighting slides one position forward — perfect for sequential iteration. If s = [1, 0, 0] (all weight on shift -1), the weighting moves backward — useful for rewinding to a previous location. And if s = [0.1, 0.8, 0.1], the weighting mostly stays but blurs slightly into neighbors, which can cause problems over many steps.
Step 3: Sharpening γt. The shift can cause blurring over time. To counteract this, the controller emits γt ≥ 1, and the weighting is sharpened:
Raising to a power greater than 1 makes high values higher and low values lower (relative to each other), then renormalizing ensures the weights still sum to 1. Large γ produces sharp, focused attention; γ = 1 leaves the weighting unchanged. This is the same "temperature" trick used in softmax, but applied after the shift operation specifically to counteract the blurring it introduces. Without sharpening, repeated shifts would gradually spread the attention across many locations, making reads and writes imprecise.
Let's work through a concrete example. Suppose the head was at location 5 last step (wt-1 = [0,...,0,1,0,...,0] with the 1 at position 5). The controller wants to move to location 6.
It sets g = 0 (ignore content, keep previous weighting), s = [0, 0, 1] (shift +1), and γ = 1 (no extra sharpening needed).
Interpolation: wg = 0 · wc + 1 · wt-1 = wt-1 (centered on location 5).
Shift: Circular convolution with [0, 0, 1] shifts everything right by 1. Now centered on location 6.
Sharpen: γ = 1 leaves it unchanged. Result: wt is focused on location 6.
This is exactly how the NTM iterates through memory during the copy task — one step forward at each time step, no content lookup required.
Let's put it all together. At each time step, for each head, the controller emits five things:
| Parameter | Symbol | Size | Purpose |
|---|---|---|---|
| Key vector | kt | M | Content lookup query |
| Key strength | βt | 1 | Sharpness of content match |
| Interpolation gate | gt | 1 | Content vs. previous weighting |
| Shift distribution | st | 3 | Direction and amount of shift |
| Sharpening factor | γt | 1 | Counteracts shift blurring |
The addressing pipeline flows in order:
This pipeline is the same for both read and write heads. The only difference is what happens after the weighting is produced: a read head uses it to retrieve a weighted sum of memory, while a write head uses it to erase and add.
What makes this design elegant is that each stage solves a specific problem and they compose cleanly:
Each stage addresses a failure mode of the previous one. Content addressing alone cannot iterate. Adding interpolation and shift enables iteration but introduces blurring. Adding sharpening fixes the blurring. The final system is both expressive and stable.
One subtle but important detail: the memory is reset at the start of each input sequence. The initial memory values, the initial read vector, and the initial weighting are all learned bias vectors. The network learns what a "blank slate" should look like. This is important because the memory is working storage, not learned knowledge. Each sequence gets a fresh workspace, just like how a function call in a programming language gets a fresh stack frame.
The total number of parameters is surprisingly small. For the copy task, the NTM feedforward controller used only 17,162 parameters — compared to 1,352,969 for a comparable LSTM. The memory matrix (128 × 20 = 2,560 values) is not a parameter; it is temporary storage. The addressing mechanism adds only M + 6 = 26 outputs per head. The NTM achieves its power not from having more parameters, but from having a more structured computation — separation of storage and processing, the same principle that makes computers powerful.
This is the NTM's signature demonstration. The network receives a sequence of random binary vectors, followed by a delimiter, and must reproduce the entire sequence from memory. Watch the memory matrix fill up as vectors are written, then drain as they are read back out.
The visualization shows four panels. Top left: the input sequence as a heatmap (each column is a time step, each row is a bit). Top right: the output sequence the NTM produces. Bottom left: the memory matrix — watch rows light up as data is written. Bottom right: the write head attention (orange) and read head attention (teal) over time.
The attention traces reveal the algorithm the NTM has learned. During the write phase, you see a warm diagonal line — the write head starts at location 0 and advances by +1 each step. During the read phase, you see a teal diagonal — the read head returns to location 0 and advances again. This is the hallmark of sequential addressing with location-based shifts: g ≈ 0 (ignore content), s ≈ [0, 0, 1] (shift +1), γ high (keep attention sharp).
Try different sequence lengths. With short sequences (4–5), the pattern is easy to see. With longer sequences (10–12), notice how the memory matrix fills more rows and the attention diagonals extend further. In the actual paper, the NTM generalized to sequences five times longer than training — the algorithm scales because it is a genuine loop, not a fixed-size pattern.
The paper tests five algorithmic tasks. Each is designed to probe a different capability of the NTM. All experiments compare three architectures: NTM with feedforward controller, NTM with LSTM controller, and a standalone LSTM network (no external memory).
Copy. Input a sequence of random 8-bit vectors (length 1–20), then output the same sequence. The NTM learned dramatically faster than LSTM and converged to near-zero error. More strikingly, an NTM trained on lengths up to 20 could copy sequences of length 100+ with no additional training. LSTM fell apart beyond length 20. The NTM had genuinely learned a copy algorithm, not just memorized training patterns.
By examining the head attention patterns, the authors reverse-engineered the learned program. In pseudocode:
// Write phase move_head(start) while input_delimiter not seen: write(input_vector, head_location) head_location += 1 // Read phase move_head(start) while True: output = read(head_location) emit(output) head_location += 1
This is precisely how a programmer would implement a copy in assembly language. The NTM discovered this algorithm from examples alone, with no human guidance about how to use the memory.
Repeat copy. Copy the input sequence a specified number of times. This task requires a form of nested loop: an outer loop over repetitions, and an inner loop over sequence positions. The NTM learned to use the external memory as a loop counter: write the sequence once, then re-read it the requested number of times, resetting the read head to the beginning after each pass. Again, it generalized to longer sequences and more repetitions than seen in training.
Associative recall. Given a list of (key, value) pairs followed by a query key, return the associated value. This tests content-based addressing directly — the NTM must store key-value bindings in memory, then use the query key to retrieve the corresponding value. The NTM excelled, achieving near-zero error. LSTM struggled because it had to compress the entire association table into its fixed-size hidden state, which becomes a bottleneck as the number of items grows. The NTM's error, by contrast, was independent of the number of stored items, up to the memory capacity.
Dynamic N-grams. Predict the next bit given the previous N bits, where the underlying distribution changes each sequence. This tests the NTM's ability to store and update statistical tables in memory — essentially maintaining a frequency count for each N-gram pattern. The NTM achieved performance close to the Bayesian optimal predictor, suggesting it learned to implement something akin to a counting algorithm in its memory operations.
Priority sort. Given a sequence of (value, priority) pairs, output the values sorted by priority. This is the most complex task and required multiple read/write heads (up to 8 for the feedforward controller, 5 for LSTM). The NTM learned a sorting algorithm, though generalization was less robust than for the simpler tasks. The feedforward controller needed 8 heads because, lacking internal recurrence, it could only perform unary vector operations per head per step — sorting requires comparing and swapping multiple items simultaneously.
| Task | Key capability tested | NTM advantage over LSTM |
|---|---|---|
| Copy | Sequential write + read | 10× faster convergence, generalizes to 5× longer |
| Repeat copy | Loop control | Learns loop counter in memory |
| Associative recall | Content-based lookup | Scales with memory, not hidden size |
| Dynamic N-grams | Online statistics | Near Bayesian-optimal prediction |
| Priority sort | Multi-head coordination | Learns with 500K params vs 5M for LSTM |
This generalization property deserves emphasis. Standard neural networks are notoriously poor at extrapolation — a network trained on inputs in [0, 10] rarely performs well on inputs at 50. The NTM's generalization comes from its structural inductive bias: the separation of memory and control means the same small program (the controller's weights) can operate on data of any size, as long as the memory is large enough. The algorithm does not change with the input length; only the number of steps increases. This is exactly the property that makes real computer programs general.
The NTM also used far fewer parameters than LSTM for comparable tasks. For the copy task: NTM feedforward used 17,162 parameters; LSTM used 1,352,969. An 80× reduction. The external memory provides storage capacity without adding parameters to the network itself.
All experiments used the RMSProp optimizer with momentum 0.9, and gradient components were clipped elementwise to [-10, 10]. The memory size was 128 × 20 for all tasks. The LSTM baseline used 3 stacked hidden layers, with sizes ranging from 128 to 512 units depending on the task. The number of LSTM parameters grows quadratically with hidden size (due to recurrent connections), while NTM parameters do not grow with memory size — a fundamental structural advantage.
One fascinating detail: during the copy task, the NTM's memory usage was bounded by the sequence length, not the total memory capacity. The cyclical shift mechanism meant that for sequences longer than 128 (the memory size), the head would wrap around and overwrite earlier writes. The network's generalization limit was literally the size of its RAM.
The training setup itself was notable for its simplicity. All tasks used binary cross-entropy loss with logistic sigmoid outputs. The networks were trained from scratch on each task, with no pre-training or curriculum learning. The NTM's architectural inductive bias — the separation of memory and computation, the addressing mechanisms — was sufficient for the network to discover the right algorithms. No human engineering of the memory access patterns was required; the gradients found the solution.
The comparison with LSTM is particularly instructive on the priority sort task. The LSTM baseline used 3 hidden layers of 128 units each, totaling 384,424 parameters. The NTM with feedforward controller used 508,305 parameters (due to 8 heads), but the NTM with LSTM controller used only 269,038. In all cases, the NTM converged faster and to lower error. The external memory let a smaller controller solve a harder problem.
The Neural Turing Machine sits at the intersection of two deep ideas: augmenting neural networks with structured memory, and making discrete computation differentiable. Its influence rippled through the field in both directions.
Memory Networks (Weston et al., 2014). Published the same year by Jason Weston and colleagues at Facebook AI Research, Memory Networks also coupled a neural network to external memory. The key difference: Memory Networks used hard attention (selecting a single memory slot), making them non-differentiable and requiring supervision for the memory access patterns — you had to tell the model which memory slot to read, rather than letting it learn. The NTM's soft attention was more elegant because it trained end-to-end with no memory supervision. Sukhbaatar et al. (2015) later bridged this gap with End-to-End Memory Networks, which adopted soft attention like the NTM, proving that the NTM's approach was the right one.
Differentiable Neural Computer (Graves et al., 2016). The direct successor. The DNC replaced the NTM's convolutional shift with more sophisticated addressing: temporal link matrices to track write order, a usage vector to find free memory locations, and allocation-based writing. The DNC generalized better on graph traversal and puzzle tasks but was significantly more complex. The NTM's simplicity remains instructive.
Attention mechanisms and Transformers. The NTM's content-based addressing — computing cosine similarity between a query and memory, then taking a softmax-weighted sum — is structurally identical to the attention mechanism in Bahdanau et al. (2014) and the Transformer (Vaswani et al., 2017). In the Transformer, queries attend to keys and retrieve values from the same sequence. In the NTM, the key attends to memory rows and retrieves their contents. The mechanism is the same; the memory source is different.
In fact, you can view a Transformer layer as an NTM where the "memory" is the set of key-value vectors from other positions in the sequence, and the "controller" is the query computation. The Transformer removed the sequential controller (replacing it with parallel self-attention) and eliminated location-based addressing (relying on positional encodings instead). These simplifications made the architecture parallelizable and scalable, but the core memory-access principle originated here.
Modern retrieval-augmented generation (RAG). Today's large language models use retrieval-augmented generation: look up relevant documents from an external database, then condition generation on the retrieved text. This is the NTM's philosophy at scale — separate what the model knows how to do (encoded in weights) from what it knows about (stored externally). The model does not try to memorize every fact in its parameters; it stores knowledge externally and retrieves it when needed. The NTM was arguably the first differentiable implementation of this principle, though at a much smaller scale than modern RAG systems.
Working memory in neuroscience. The paper explicitly draws parallels to working memory in cognitive science — the prefrontal cortex as controller, short-term memory buffers as the memory matrix. The NTM's learned read/write policies mirror how psychologists describe working memory: a "central executive" that directs attention to manipulate stored information. The paper cites psychologist George Miller's famous "magical number seven" — human working memory holds about 7 ± 2 chunks. The NTM is not constrained by this biological limit; its memory can be arbitrarily large.
Neural program induction vs. synthesis. The NTM occupies an interesting middle ground. It does not synthesize a program as explicit code; instead, it induces program-like behavior through its learned controller weights. The controller is a fixed neural network that, when run step-by-step with access to external memory, executes a procedure. Later work like Neural Programmer-Interpreters (Reed & de Freitas, 2015) and Neural Program Synthesis (Parisotto et al., 2016) pushed further toward explicit program generation, but the NTM showed that implicit program learning via memory manipulation is surprisingly powerful.
Limitations. The NTM has real shortcomings. Training is unstable — the attention mechanisms can saturate or oscillate, and the authors used gradient clipping (components clipped to [-10, 10]) to keep training stable. The convolutional shift is limited to small offsets ({-1, 0, +1}), making random access to distant locations slow. And the memory has a fixed size N, set before training; the network cannot dynamically allocate new memory. The DNC addressed all three issues, but at the cost of significantly more complexity.
Fodor and Pylyshyn's challenge. The paper opens with an underappreciated context: the 1988 critique by Fodor and Pylyshyn, who argued that neural networks cannot do variable binding — assigning arbitrary data to arbitrary slots. In language, when you hear "Mary spoke to John," you bind "Mary" to the subject role and "John" to the object role. Fodor and Pylyshyn claimed connectionist systems could not do this. The NTM is a direct answer: its write operation binds data (the add vector) to a slot (a memory location). The NTM does not just refute the critique theoretically — it demonstrates variable binding empirically, learning it from examples.
Paper details. "Neural Turing Machines," Alex Graves, Greg Wayne, Ivo Danihelka. Google DeepMind, 2014. arXiv:1410.5401. One of Ilya Sutskever's "30 under 30" papers for understanding modern deep learning.