Neural Turing Machines

Chapter 0: The Limitation

Imagine you need to copy a sequence of numbers. Someone reads you ten digits: 3, 7, 1, 4, 9, 2, 6, 8, 0, 5. Now repeat them back.

Easy, right? You store them in short-term memory, then read them out. You do not try to compress the digits into some internal feeling about what they mean. You simply remember the raw data and recall it. Your brain has a dedicated workspace for this — psychologists call it working memory. It holds a small amount of information (famously, "seven plus or minus two" items) and manipulates it according to rules.

Now try doing this with an LSTM.

An LSTM has a fixed-size hidden state — say, 256 numbers. It must squeeze everything it has ever seen into that vector. For short sequences, it works. But ask it to copy 100 random vectors, and it crumbles. The hidden state cannot hold that much raw data. The information gets compressed, mixed, and eventually lost.

This is not a failure of training or optimization. It is a fundamental architectural constraint. You are trying to store 100 × 8 = 800 bits of arbitrary data in a 256-dimensional continuous vector. Information theory tells us this is a losing game — the hidden state simply does not have enough capacity to faithfully represent all that data.

The core problem: RNNs and LSTMs store everything in their internal hidden state. This state has a fixed capacity. When a task requires storing and retrieving arbitrary data — like a copy operation or a sorting algorithm — the hidden state becomes a bottleneck. What if we gave the network an external scratchpad it could write to and read from?

This is not just a limitation of LSTMs. It is a limitation of all standard neural network architectures. Feedforward networks have no memory at all — they process each input independently. RNNs have memory (the hidden state), but it is entangled with computation. The hidden state must simultaneously serve as both "what I remember" and "what I am currently computing." As tasks become more complex, these two roles conflict.

Computers solved this problem decades ago. A CPU has a tiny set of registers (like an LSTM's hidden state), but it also has RAM — a large, addressable memory bank. The CPU writes data to specific addresses in RAM and later reads it back. The registers handle computation; the RAM handles storage. Separating storage from processing is one of the most powerful principles in computer architecture — it is what makes computers able to handle arbitrary amounts of data with a fixed-size processor.

Graves, Wayne, and Danihelka asked: can we give a neural network its own RAM? And more importantly, can we make the whole thing differentiable, so the network learns how to use its memory through gradient descent?

The answer was the Neural Turing Machine (NTM): a neural network coupled to an external memory matrix, interacting with it through learned attention mechanisms. The name is a deliberate homage to Alan Turing's theoretical machine — a finite controller with access to an infinite tape. The NTM is the same idea, made differentiable.

The challenge was not conceptual — the idea of giving neural networks external memory is obvious. The challenge was engineering: how do you make memory access differentiable? A conventional memory lookup says "give me the data at address 7." That is a discrete operation with no gradient. You cannot backpropagate through "pick address 7" because there is no smooth function mapping a continuous input to a discrete address choice.

The NTM's breakthrough was replacing hard address selection with soft attention: instead of picking one address, attend to all addresses simultaneously with different weights. The gradient flows through the weights, and the network learns which addresses to attend to. This idea — making a discrete selection continuous by replacing it with a weighted average — would become one of the most influential techniques in deep learning, underpinning attention mechanisms, mixture-of-experts routing, and differentiable programming more broadly.

Why can't an LSTM reliably copy a long sequence of arbitrary vectors?

Its fixed-size hidden state cannot store arbitrary amounts of raw data — information gets compressed and lost as the sequence grows LSTMs cannot process sequences longer than their training length The vanishing gradient problem prevents LSTMs from learning copy tasks

Chapter 1: Architecture Overview

The NTM has two components: a controller and a memory matrix.

The controller is a neural network — either a feedforward network or an LSTM. It receives input from the outside world, produces output, and most importantly, it interacts with the memory matrix through read heads and write heads.

The memory matrix M is an N × M matrix. Think of it as N rows, each storing an M-dimensional vector. N is the number of memory locations (addresses), and M is the width of each memory slot. In the paper's experiments, N = 128 locations, each holding a 20-dimensional vector. That is 2,560 floating-point numbers of working storage — not a lot by modern standards, but enough to store and manipulate sequences of over 100 vectors.

Input

External data enters the controller

↓

Controller (LSTM or Feedforward)

Processes input, emits output, and produces head parameters

↓ read/write heads

Memory Matrix (N × M)

External storage: 128 locations × 20-dimensional vectors

The key distinction from a standard RNN: the controller and memory are separate. The controller's weights are fixed after training (they encode the "program"). The memory contents change at every time step (they hold the "data"). This separation is exactly the stored-program architecture that Von Neumann described in 1945, now implemented with differentiable components.

At each time step, the controller does three things:

1. Read. Each read head produces a weighting over memory locations — a vector of N non-negative values that sum to 1. This weighting acts like a soft attention distribution. The read head returns a weighted sum of all memory rows. If the weighting is sharp (concentrated on one location), you get that location's content. If it is diffuse, you get a blend.

2. Write. Each write head also produces a weighting, plus two vectors: an erase vector (which elements to clear) and an add vector (what new data to store). This two-step erase-then-add process is borrowed directly from LSTM's forget-then-input gating mechanism, generalized to an external memory bank.

3. Output. The controller produces an output for the external world, informed by what it just read from memory. The read vector from the previous step is typically concatenated with the current input and fed into the controller, so the controller always knows what it last retrieved from memory.

The key insight: Everything is soft. Instead of reading from one address (like a computer), the NTM reads from all addresses simultaneously, weighted by attention. This makes the entire system differentiable — gradients flow through the read and write operations, so the network learns where to read and what to write through backpropagation.

The controller can be either feedforward or recurrent (LSTM). A feedforward controller has no internal memory — it relies entirely on the external memory matrix. An LSTM controller has its own hidden state in addition to the memory matrix, like having both CPU registers and RAM. The paper found that both work, but LSTM controllers are more robust because they can buffer information internally across time steps.

There is an important subtlety about feedforward controllers. With a single read head, a feedforward controller can only perform a unary operation on one memory vector per step. If you need to compare two memory locations (say, for sorting), you need two read heads. An LSTM controller can work around this by caching previous reads in its hidden state, but a feedforward controller cannot. This is why the priority sort task required 8 heads with a feedforward controller but only 5 with LSTM.

Why must the NTM's memory access be "soft" (weighted across all locations) rather than "hard" (selecting a single address)?

Soft access is differentiable, allowing gradients to flow through memory operations so the network can learn where to read and write via backpropagation Hard access would be too slow for real-time processing Soft access uses less memory than hard access

Chapter 2: Reading from Memory

Reading is beautifully simple. You have a memory matrix M_t with N rows, and a weighting vector w_t with N elements that sum to 1. The read operation returns a weighted combination of all memory rows:

r_t = ∑_i w_t(i) · M_t(i)

Where M_t(i) is the i-th row of memory (an M-dimensional vector), and w_t(i) is the attention weight on location i. The constraints on w are simple: every element must be between 0 and 1, and they must sum to 1. In other words, w is a probability distribution over memory locations. This makes the read operation a convex combination — the result always lies "within" the cloud of memory vectors, never outside it.

Let's work through a concrete example. Suppose we have 4 memory locations, each storing a 3-dimensional vector:

Location	Content	Weight w(i)
0	[1.0, 0.0, 0.5]	0.1
1	[0.0, 1.0, 0.0]	0.7
2	[0.5, 0.5, 1.0]	0.1
3	[0.0, 0.0, 0.0]	0.1

The read vector is:

r = 0.1×[1,0,0.5] + 0.7×[0,1,0] + 0.1×[0.5,0.5,1] + 0.1×[0,0,0]
= [0.1, 0, 0.05] + [0, 0.7, 0] + [0.05, 0.05, 0.1] + [0, 0, 0]
= [0.15, 0.75, 0.15]

Because w(1) = 0.7 dominates, the result is close to location 1's content [0, 1, 0]. The read head is mostly "looking at" location 1, with a bit of blurring from other locations.

Notice that this is exactly the same operation as attention in sequence-to-sequence models. In Bahdanau attention (2014), the decoder computes a weighted sum of encoder hidden states. In the NTM, the controller computes a weighted sum of memory rows. The mathematical form is identical — the NTM just applies it to an external memory bank instead of another neural network's hidden states. This is not a coincidence: the NTM and Bahdanau attention were developed around the same time, and both represent the same fundamental insight about differentiable information retrieval.

Think of it this way. A hard read picks one row: r = M(3). A soft read blends all rows according to attention. When attention is very sharp (one weight near 1, the rest near 0), soft reading approximates hard reading. But it remains differentiable, so gradient descent can nudge the attention weights toward better memory locations.

If the weighting vector is [0, 0, 0, 1], what does the read operation return?

Exactly the content of memory location 3 — a perfectly sharp weighting is equivalent to a hard read from that single address A blend of all four locations equally The zero vector, since the first three weights are zero

Chapter 3: Writing to Memory

Writing is more complex than reading. Inspired by LSTM's input and forget gates, the paper decomposes each write into two steps: erase, then add.

Step 1: Erase. The write head produces an erase vector e_t with M elements, each in the range (0, 1). The memory is erased pointwise:

M̃_t(i) = M_t-1(i) · [1 − w_t(i) · e_t]

If both the weight w_t(i) and the erase element e_t(j) are 1, that element of that memory location is set to zero. If either is 0, the memory is unchanged. This gives fine-grained control: you can erase specific elements of specific locations.

Step 2: Add. The write head produces an add vector a_t with M elements. After erasing, the new data is written:

M_t(i) = M̃_t(i) + w_t(i) · a_t

The weight w_t(i) controls where the data goes. The add vector a_t controls what data is written. Locations with high weight receive the full add vector; locations with low weight receive almost nothing.

Why erase-then-add? This two-step process is analogous to how you would update a variable in a computer: first clear the old value, then write the new one. If you only added without erasing, the memory would accumulate endlessly — old data would pile up underneath new data. If you only erased, you could never write new information. The combination allows clean overwrites: erase the parts you want to change, then add the new values. And crucially, both steps are differentiable — element-wise multiplication and addition have well-defined gradients, so the network can learn what to erase and what to add through standard backpropagation.

There is an elegant symmetry with LSTM gates. The LSTM's forget gate decides what to erase from the cell state (multiplication by a value in [0, 1]). The input gate decides what to add. The NTM's erase and add vectors play exactly the same roles, but applied to an external memory matrix rather than an internal cell state. The NTM generalizes the LSTM's gated memory to an addressable, multi-location memory bank.

Let's trace a concrete write. Suppose location 2 currently holds [0.5, 0.5, 1.0], the write weight on location 2 is 0.9, the erase vector is [1, 1, 0], and the add vector is [0.3, 0.8, 0.0].

Erase: M̃(2) = [0.5, 0.5, 1.0] × [1 − 0.9 × [1, 1, 0]] = [0.5, 0.5, 1.0] × [0.1, 0.1, 1.0] = [0.05, 0.05, 1.0]

Add: M(2) = [0.05, 0.05, 1.0] + 0.9 × [0.3, 0.8, 0.0] = [0.05, 0.05, 1.0] + [0.27, 0.72, 0.0] = [0.32, 0.77, 1.0]

The first two elements were erased and replaced; the third element (erase = 0) was left untouched. The network learned to selectively update only the parts of memory it wanted to change.

An important property: when multiple write heads are present, the erasures can be performed in any order because multiplication is commutative. Similarly, the adds are order-independent. This means the system is well-defined regardless of how we schedule multiple heads — a valuable property for implementation and for theoretical analysis of the system's behavior.

If the erase vector is [0, 0, 0] (all zeros), what happens to memory during the write operation?

Nothing is erased — the add vector is simply added on top of the existing memory content, accumulating data The entire memory location is cleared to zero The write operation is skipped entirely

Chapter 4: Content-Based Addressing

We know that read and write operations use a weighting vector over memory locations. But how is that weighting produced? The NTM uses two complementary addressing mechanisms. The first is content-based addressing.

The idea: find memory locations whose contents are similar to a query. This is exactly how you recall a phone number — you think of the person's name, and your brain retrieves the associated number. You address by content, not by position. This is also how Hopfield networks (1982) work: store patterns in a weight matrix, then retrieve them by providing a partial or noisy version of the pattern. The NTM makes the same principle differentiable and integrates it into an end-to-end trainable system.

Each head produces a key vector k_t (length M) and a key strength β_t (a positive scalar). The content-based weighting is computed by comparing the key to every row in memory using cosine similarity, then applying a softmax sharpened by β:

w_t^c(i) = exp(β_t · K(k_t, M_t(i))) / ∑_j exp(β_t · K(k_t, M_t(j)))

Where K is cosine similarity:

K(u, v) = (u · v) / (||u|| · ||v||)

This is just a softmax over similarity scores, weighted by β. Let's unpack what β does.

When β is small (close to 0), the softmax output is nearly uniform — the head attends to all locations roughly equally, regardless of similarity. When β is large (say, 100), the softmax becomes very sharp, concentrating almost all weight on the single most similar location. The controller learns to adjust β to control how precise its memory lookups should be.

Think of it as a library search. The key vector k is your search query. Each memory location is a book. Cosine similarity measures how relevant each book is to your query. The key strength β controls how picky you are: low β means "give me a little of everything," high β means "give me only the best match."

Content-based addressing is powerful for associative recall: you store a key-value pair, then later retrieve the value by providing a similar key. The paper tested this with an associative recall task, and the NTM learned to do it perfectly. This is directly analogous to a hash table lookup, except it is fuzzy: the query does not need to exactly match the key, just be similar enough for cosine similarity to identify the right location.

You might wonder: why cosine similarity rather than, say, Euclidean distance? Cosine similarity measures the angle between vectors, ignoring their magnitude. This makes it invariant to scaling — a memory vector that is "loud" (large magnitude) and one that is "quiet" (small magnitude) can still match perfectly if they point in the same direction. This is important because the controller and memory may operate at different scales.

But content-based addressing has a limitation. What if the content of a memory location is arbitrary? In a copy task, the data is random binary vectors — there is no meaningful content to search for. You just need to write to "the next empty slot" and later read from "slot 1, then slot 2, then slot 3." You need location-based addressing.

What does the key strength β control in content-based addressing?

How sharp or diffuse the attention distribution is — large β concentrates weight on the best match, small β spreads weight evenly The dimensionality of the key vector How many memory locations the head can access simultaneously

Chapter 5: Location-Based Addressing

Location-based addressing lets the head move through memory like a tape head on a Turing machine: step forward, step backward, or stay in place. It uses three mechanisms: interpolation, shift, and sharpening.

Step 1: Interpolation gate g_t. The controller emits a scalar g_t ∈ (0, 1) that blends between the content-based weighting and the previous time step's weighting:

w_t^g = g_t · w_t^c + (1 − g_t) · w_t-1

If g = 1, the head uses pure content addressing (fresh lookup). If g = 0, the head ignores content entirely and reuses last step's weighting. This is critical for iteration: once you have found the starting location, set g = 0 and just shift from where you were.

Step 2: Convolutional shift s_t. The controller emits a shift distribution s_t over allowed shifts (typically {-1, 0, +1}). The gated weighting is rotated via circular convolution:

w̃_t(i) = ∑_j w_t^g(j) · s_t(i − j mod N)

If s = [0, 1, 0] (all weight on shift 0), the weighting does not move. If s = [0, 0, 1] (all weight on shift +1), the entire weighting slides one position forward — perfect for sequential iteration. If s = [1, 0, 0] (all weight on shift -1), the weighting moves backward — useful for rewinding to a previous location. And if s = [0.1, 0.8, 0.1], the weighting mostly stays but blurs slightly into neighbors, which can cause problems over many steps.

Step 3: Sharpening γ_t. The shift can cause blurring over time. To counteract this, the controller emits γ_t ≥ 1, and the weighting is sharpened:

w_t(i) = w̃_t(i)^γ_t / ∑_j w̃_t(j)^γ_t

Raising to a power greater than 1 makes high values higher and low values lower (relative to each other), then renormalizing ensures the weights still sum to 1. Large γ produces sharp, focused attention; γ = 1 leaves the weighting unchanged. This is the same "temperature" trick used in softmax, but applied after the shift operation specifically to counteract the blurring it introduces. Without sharpening, repeated shifts would gradually spread the attention across many locations, making reads and writes imprecise.

Let's work through a concrete example. Suppose the head was at location 5 last step (w_t-1 = [0,...,0,1,0,...,0] with the 1 at position 5). The controller wants to move to location 6.

It sets g = 0 (ignore content, keep previous weighting), s = [0, 0, 1] (shift +1), and γ = 1 (no extra sharpening needed).

Interpolation: w^g = 0 · w^c + 1 · w_t-1 = w_t-1 (centered on location 5).

Shift: Circular convolution with [0, 0, 1] shifts everything right by 1. Now centered on location 6.

Sharpen: γ = 1 leaves it unchanged. Result: w_t is focused on location 6.

This is exactly how the NTM iterates through memory during the copy task — one step forward at each time step, no content lookup required.

The three modes. The combined system can operate in three ways. Mode 1: Pure content addressing (g = 1, no shift). Mode 2: Content addressing followed by a shift — find a block of data by content, then offset to a nearby element. Mode 3: Pure location addressing (g = 0, shift from previous position) — iterate sequentially through memory without any content lookup.

During a copy task, the NTM writes input vectors to consecutive memory locations. Which addressing mode does this most likely use?

Mode 3: set g = 0 to ignore content, shift +1 each step to advance through memory sequentially Mode 1: use content addressing to find empty locations Mode 2: search by content, then shift to the nearest empty slot

Chapter 6: The Full Addressing Pipeline

Let's put it all together. At each time step, for each head, the controller emits five things:

Parameter	Symbol	Size	Purpose
Key vector	k_t	M	Content lookup query
Key strength	β_t	1	Sharpness of content match
Interpolation gate	g_t	1	Content vs. previous weighting
Shift distribution	s_t	3	Direction and amount of shift
Sharpening factor	γ_t	1	Counteracts shift blurring

The addressing pipeline flows in order:

Content Addressing

Compare k_t to all memory rows using cosine similarity, scale by β_t, softmax → w^c

↓

Interpolation

Blend w^c with previous weighting w_t-1 using gate g_t → w^g

↓

Convolutional Shift

Circularly shift w^g by s_t → w̃

↓

Sharpening

Raise w̃ to power γ_t and renormalize → w_t

This pipeline is the same for both read and write heads. The only difference is what happens after the weighting is produced: a read head uses it to retrieve a weighted sum of memory, while a write head uses it to erase and add.

What makes this design elegant is that each stage solves a specific problem and they compose cleanly:

Content addressing solves "where is similar data?"
Interpolation solves "should I look up fresh or continue from last step?"
Shift solves "should I move to a nearby location?"
Sharpening solves "has the attention become too blurry?"

Each stage addresses a failure mode of the previous one. Content addressing alone cannot iterate. Adding interpolation and shift enables iteration but introduces blurring. Adding sharpening fixes the blurring. The final system is both expressive and stable.

Parameter count. For each head, the controller must output M + 1 + 1 + 3 + 1 = M + 6 additional values. With M = 20 (as in the paper's experiments), that is 26 extra outputs per head. A single read head and a single write head add only 52 parameters to the controller's output layer. The memory matrix itself (128 × 20 = 2,560 values) is not a learned parameter — it is working storage, initialized fresh for each sequence.

One subtle but important detail: the memory is reset at the start of each input sequence. The initial memory values, the initial read vector, and the initial weighting are all learned bias vectors. The network learns what a "blank slate" should look like. This is important because the memory is working storage, not learned knowledge. Each sequence gets a fresh workspace, just like how a function call in a programming language gets a fresh stack frame.

The total number of parameters is surprisingly small. For the copy task, the NTM feedforward controller used only 17,162 parameters — compared to 1,352,969 for a comparable LSTM. The memory matrix (128 × 20 = 2,560 values) is not a parameter; it is temporary storage. The addressing mechanism adds only M + 6 = 26 outputs per head. The NTM achieves its power not from having more parameters, but from having a more structured computation — separation of storage and processing, the same principle that makes computers powerful.

In the addressing pipeline, what is the purpose of the interpolation gate g_t?

It blends between the current content-based weighting and the previous step's weighting, allowing the head to either do a fresh lookup or continue from where it left off It controls how much data is erased from memory before writing It determines the number of memory locations the head can access

Chapter 7: Showcase — NTM Copy Task

This is the NTM's signature demonstration. The network receives a sequence of random binary vectors, followed by a delimiter, and must reproduce the entire sequence from memory. Watch the memory matrix fill up as vectors are written, then drain as they are read back out.

What to watch for. During the write phase, the write head advances through memory locations 0, 1, 2, ..., writing each input vector to the next row. During the read phase, the read head returns to location 0 and advances through the same locations, reading each vector back out. The attention patterns should be sharp diagonals — each step attends to exactly one location.

Sequence length: 8 Press Run to start

The visualization shows four panels. Top left: the input sequence as a heatmap (each column is a time step, each row is a bit). Top right: the output sequence the NTM produces. Bottom left: the memory matrix — watch rows light up as data is written. Bottom right: the write head attention (orange) and read head attention (teal) over time.

The attention traces reveal the algorithm the NTM has learned. During the write phase, you see a warm diagonal line — the write head starts at location 0 and advances by +1 each step. During the read phase, you see a teal diagonal — the read head returns to location 0 and advances again. This is the hallmark of sequential addressing with location-based shifts: g ≈ 0 (ignore content), s ≈ [0, 0, 1] (shift +1), γ high (keep attention sharp).

Try different sequence lengths. With short sequences (4–5), the pattern is easy to see. With longer sequences (10–12), notice how the memory matrix fills more rows and the attention diagonals extend further. In the actual paper, the NTM generalized to sequences five times longer than training — the algorithm scales because it is a genuine loop, not a fixed-size pattern.

What the output tells you. The output heatmap should closely match the input. Any discrepancy represents a "copy error." In the real NTM, trained to convergence, the copy is near-perfect. Our simulation adds slight noise to make the output visually distinguishable from the input, but the core pattern — sequential write, then sequential read — is faithful to the paper's Figure 6.

In the copy task visualization, why do the write and read head attention patterns both form diagonal lines?

Because the NTM learns to step through memory locations sequentially — location 0, then 1, then 2, etc. — creating a diagonal in the time-vs-location plot Because the NTM uses content-based addressing to find matching vectors Because the memory matrix is initialized as a diagonal matrix

Chapter 8: The Experiments

The paper tests five algorithmic tasks. Each is designed to probe a different capability of the NTM. All experiments compare three architectures: NTM with feedforward controller, NTM with LSTM controller, and a standalone LSTM network (no external memory).

Copy. Input a sequence of random 8-bit vectors (length 1–20), then output the same sequence. The NTM learned dramatically faster than LSTM and converged to near-zero error. More strikingly, an NTM trained on lengths up to 20 could copy sequences of length 100+ with no additional training. LSTM fell apart beyond length 20. The NTM had genuinely learned a copy algorithm, not just memorized training patterns.

By examining the head attention patterns, the authors reverse-engineered the learned program. In pseudocode:

// Write phase
move_head(start)
while input_delimiter not seen:
    write(input_vector, head_location)
    head_location += 1

// Read phase
move_head(start)
while True:
    output = read(head_location)
    emit(output)
    head_location += 1

This is precisely how a programmer would implement a copy in assembly language. The NTM discovered this algorithm from examples alone, with no human guidance about how to use the memory.

Repeat copy. Copy the input sequence a specified number of times. This task requires a form of nested loop: an outer loop over repetitions, and an inner loop over sequence positions. The NTM learned to use the external memory as a loop counter: write the sequence once, then re-read it the requested number of times, resetting the read head to the beginning after each pass. Again, it generalized to longer sequences and more repetitions than seen in training.

Associative recall. Given a list of (key, value) pairs followed by a query key, return the associated value. This tests content-based addressing directly — the NTM must store key-value bindings in memory, then use the query key to retrieve the corresponding value. The NTM excelled, achieving near-zero error. LSTM struggled because it had to compress the entire association table into its fixed-size hidden state, which becomes a bottleneck as the number of items grows. The NTM's error, by contrast, was independent of the number of stored items, up to the memory capacity.

Dynamic N-grams. Predict the next bit given the previous N bits, where the underlying distribution changes each sequence. This tests the NTM's ability to store and update statistical tables in memory — essentially maintaining a frequency count for each N-gram pattern. The NTM achieved performance close to the Bayesian optimal predictor, suggesting it learned to implement something akin to a counting algorithm in its memory operations.

Priority sort. Given a sequence of (value, priority) pairs, output the values sorted by priority. This is the most complex task and required multiple read/write heads (up to 8 for the feedforward controller, 5 for LSTM). The NTM learned a sorting algorithm, though generalization was less robust than for the simpler tasks. The feedforward controller needed 8 heads because, lacking internal recurrence, it could only perform unary vector operations per head per step — sorting requires comparing and swapping multiple items simultaneously.

Task	Key capability tested	NTM advantage over LSTM
Copy	Sequential write + read	10× faster convergence, generalizes to 5× longer
Repeat copy	Loop control	Learns loop counter in memory
Associative recall	Content-based lookup	Scales with memory, not hidden size
Dynamic N-grams	Online statistics	Near Bayesian-optimal prediction
Priority sort	Multi-head coordination	Learns with 500K params vs 5M for LSTM

The hallmark of an algorithm. The strongest evidence that the NTM learned algorithms (rather than pattern-matching) is generalization beyond training range. A network trained on sequences of length 1–20 that correctly copies sequences of length 120 has clearly learned the procedure "write, advance, write, advance, ..., return, read, advance, read, advance, ..." rather than memorizing input-output pairs. This is qualitatively different from standard neural network generalization.

This generalization property deserves emphasis. Standard neural networks are notoriously poor at extrapolation — a network trained on inputs in [0, 10] rarely performs well on inputs at 50. The NTM's generalization comes from its structural inductive bias: the separation of memory and control means the same small program (the controller's weights) can operate on data of any size, as long as the memory is large enough. The algorithm does not change with the input length; only the number of steps increases. This is exactly the property that makes real computer programs general.

The NTM also used far fewer parameters than LSTM for comparable tasks. For the copy task: NTM feedforward used 17,162 parameters; LSTM used 1,352,969. An 80× reduction. The external memory provides storage capacity without adding parameters to the network itself.

All experiments used the RMSProp optimizer with momentum 0.9, and gradient components were clipped elementwise to [-10, 10]. The memory size was 128 × 20 for all tasks. The LSTM baseline used 3 stacked hidden layers, with sizes ranging from 128 to 512 units depending on the task. The number of LSTM parameters grows quadratically with hidden size (due to recurrent connections), while NTM parameters do not grow with memory size — a fundamental structural advantage.

One fascinating detail: during the copy task, the NTM's memory usage was bounded by the sequence length, not the total memory capacity. The cyclical shift mechanism meant that for sequences longer than 128 (the memory size), the head would wrap around and overwrite earlier writes. The network's generalization limit was literally the size of its RAM.

The training setup itself was notable for its simplicity. All tasks used binary cross-entropy loss with logistic sigmoid outputs. The networks were trained from scratch on each task, with no pre-training or curriculum learning. The NTM's architectural inductive bias — the separation of memory and computation, the addressing mechanisms — was sufficient for the network to discover the right algorithms. No human engineering of the memory access patterns was required; the gradients found the solution.

The comparison with LSTM is particularly instructive on the priority sort task. The LSTM baseline used 3 hidden layers of 128 units each, totaling 384,424 parameters. The NTM with feedforward controller used 508,305 parameters (due to 8 heads), but the NTM with LSTM controller used only 269,038. In all cases, the NTM converged faster and to lower error. The external memory let a smaller controller solve a harder problem.

What is the strongest evidence that the NTM learned a copy algorithm rather than memorizing patterns?

It generalized to sequence lengths far beyond those seen during training — trained on lengths up to 20, it copied sequences of length 100+ It achieved lower training error than LSTM It used fewer parameters than LSTM

Chapter 9: Connections

The Neural Turing Machine sits at the intersection of two deep ideas: augmenting neural networks with structured memory, and making discrete computation differentiable. Its influence rippled through the field in both directions.

Memory Networks (Weston et al., 2014). Published the same year by Jason Weston and colleagues at Facebook AI Research, Memory Networks also coupled a neural network to external memory. The key difference: Memory Networks used hard attention (selecting a single memory slot), making them non-differentiable and requiring supervision for the memory access patterns — you had to tell the model which memory slot to read, rather than letting it learn. The NTM's soft attention was more elegant because it trained end-to-end with no memory supervision. Sukhbaatar et al. (2015) later bridged this gap with End-to-End Memory Networks, which adopted soft attention like the NTM, proving that the NTM's approach was the right one.

Differentiable Neural Computer (Graves et al., 2016). The direct successor. The DNC replaced the NTM's convolutional shift with more sophisticated addressing: temporal link matrices to track write order, a usage vector to find free memory locations, and allocation-based writing. The DNC generalized better on graph traversal and puzzle tasks but was significantly more complex. The NTM's simplicity remains instructive.

Attention mechanisms and Transformers. The NTM's content-based addressing — computing cosine similarity between a query and memory, then taking a softmax-weighted sum — is structurally identical to the attention mechanism in Bahdanau et al. (2014) and the Transformer (Vaswani et al., 2017). In the Transformer, queries attend to keys and retrieve values from the same sequence. In the NTM, the key attends to memory rows and retrieves their contents. The mechanism is the same; the memory source is different.

In fact, you can view a Transformer layer as an NTM where the "memory" is the set of key-value vectors from other positions in the sequence, and the "controller" is the query computation. The Transformer removed the sequential controller (replacing it with parallel self-attention) and eliminated location-based addressing (relying on positional encodings instead). These simplifications made the architecture parallelizable and scalable, but the core memory-access principle originated here.

Modern retrieval-augmented generation (RAG). Today's large language models use retrieval-augmented generation: look up relevant documents from an external database, then condition generation on the retrieved text. This is the NTM's philosophy at scale — separate what the model knows how to do (encoded in weights) from what it knows about (stored externally). The model does not try to memorize every fact in its parameters; it stores knowledge externally and retrieves it when needed. The NTM was arguably the first differentiable implementation of this principle, though at a much smaller scale than modern RAG systems.

Working memory in neuroscience. The paper explicitly draws parallels to working memory in cognitive science — the prefrontal cortex as controller, short-term memory buffers as the memory matrix. The NTM's learned read/write policies mirror how psychologists describe working memory: a "central executive" that directs attention to manipulate stored information. The paper cites psychologist George Miller's famous "magical number seven" — human working memory holds about 7 ± 2 chunks. The NTM is not constrained by this biological limit; its memory can be arbitrarily large.

Neural program induction vs. synthesis. The NTM occupies an interesting middle ground. It does not synthesize a program as explicit code; instead, it induces program-like behavior through its learned controller weights. The controller is a fixed neural network that, when run step-by-step with access to external memory, executes a procedure. Later work like Neural Programmer-Interpreters (Reed & de Freitas, 2015) and Neural Program Synthesis (Parisotto et al., 2016) pushed further toward explicit program generation, but the NTM showed that implicit program learning via memory manipulation is surprisingly powerful.

Limitations. The NTM has real shortcomings. Training is unstable — the attention mechanisms can saturate or oscillate, and the authors used gradient clipping (components clipped to [-10, 10]) to keep training stable. The convolutional shift is limited to small offsets ({-1, 0, +1}), making random access to distant locations slow. And the memory has a fixed size N, set before training; the network cannot dynamically allocate new memory. The DNC addressed all three issues, but at the cost of significantly more complexity.

NTM (2014)

External memory + soft attention + differentiable read/write

↓ extended by

DNC (2016)

Temporal links, dynamic allocation, usage tracking

↓ generalized into

Transformers (2017) & RAG (2020+)

Attention as memory access, external knowledge retrieval at scale

The lasting contribution. The NTM demonstrated that you can give a neural network access to a structured, addressable memory — and it will learn to use it. Not through hand-coded rules, but through gradient descent. The specific architecture has been superseded, but the principle — augment networks with external memory, make everything differentiable — is now fundamental to modern AI. Every time a Transformer attends to a key-value store, it is channeling the NTM.

Fodor and Pylyshyn's challenge. The paper opens with an underappreciated context: the 1988 critique by Fodor and Pylyshyn, who argued that neural networks cannot do variable binding — assigning arbitrary data to arbitrary slots. In language, when you hear "Mary spoke to John," you bind "Mary" to the subject role and "John" to the object role. Fodor and Pylyshyn claimed connectionist systems could not do this. The NTM is a direct answer: its write operation binds data (the add vector) to a slot (a memory location). The NTM does not just refute the critique theoretically — it demonstrates variable binding empirically, learning it from examples.

Paper details. "Neural Turing Machines," Alex Graves, Greg Wayne, Ivo Danihelka. Google DeepMind, 2014. arXiv:1410.5401. One of Ilya Sutskever's "30 under 30" papers for understanding modern deep learning.

← Back to Veanors Hub

How does the NTM's content-based addressing relate to the attention mechanism in Transformers?

They are structurally the same — both compute similarity between a query and stored vectors, then take a softmax-weighted sum to retrieve information They are unrelated — Transformers use dot-product attention, while NTMs use cosine similarity The NTM uses hard attention, while Transformers use soft attention

Neural TuringMachines