The encoder squashes an entire sentence into one vector. What if the decoder could look back? This paper introduced attention — the mechanism that would reshape all of deep learning.
Imagine you are a simultaneous interpreter. Someone speaks an entire paragraph in French, and you must wait until they finish, hold everything in your head, then produce the English translation. For a short sentence — fine. For a long, complex paragraph — you start forgetting the beginning by the time you reach the end.
This is exactly the problem that neural machine translation faced in 2014. The field was just two years old — Kalchbrenner and Blunsom (2013) and Sutskever et al. (2014) had pioneered the idea of using neural networks to translate directly, without phrase tables or hand-crafted features. The dominant approach, the encoder-decoder architecture (Cho et al., 2014; Sutskever et al., 2014), worked like this: an RNN reads the entire source sentence one word at a time, compresses it into a single fixed-length vector, and a second RNN generates the translation from that vector alone.
The fixed-length vector is the bottleneck. Every bit of meaning — the subject, the verb tense, the negation, the subordinate clause — must be packed into one vector of, say, 1000 dimensions. For a 5-word sentence, that is plenty. For a 50-word sentence, something has to give.
Cho et al. (2014b) had already shown the evidence: BLEU scores of the basic encoder-decoder plummeted once source sentences exceeded 20 words. The information simply could not fit through the bottleneck.
Bahdanau, Cho, and Bengio proposed a strikingly elegant solution. Instead of compressing the source into one vector, keep all the encoder hidden states around. At each decoding step, let the decoder learn which source positions to attend to. This is attention — and this paper is where it was born for sequence-to-sequence models.
The key word in the title is "jointly." The model does not first learn to align and then learn to translate. It learns both simultaneously. The alignment is not a separate module with its own supervision — it emerges naturally from the translation objective. This was a radical departure from traditional statistical machine translation, where alignment was a complex preprocessing step (think IBM Models 1-5).
Before we fix the problem, let's make sure we understand the machine we are fixing. The encoder-decoder has two RNNs that work in sequence.
The encoder reads the source sentence x = (x1, x2, ..., xT) one token at a time. At each step, it updates a hidden state:
where f is a nonlinear function (an RNN cell — either a vanilla RNN, GRU, or LSTM). After processing the last word, the encoder produces a context vector c, typically just the final hidden state:
This single vector c is supposed to capture the meaning of the entire source sentence.
The decoder generates the target sentence one word at a time. At each step t, it produces a probability distribution over the vocabulary, conditioned on all previously generated words and the context vector:
where st is the decoder's hidden state and g is a function that outputs a probability. The critical detail: c is the same for every decoding step. Whether the decoder is producing the first word or the thirtieth, it works from the same frozen summary of the source. The decoder has no way to "go back and check" a specific part of the source — it must extract everything it needs from this single vector.
How much information can one vector hold? With 1000-dimensional hidden states, the context vector is a point in a 1000-dimensional space. In practice, researchers found that a single RNN hidden state can reliably encode about 20-30 words of meaningful content before quality degrades. This is not a theoretical limit — it is an empirical observation confirmed by Cho et al. (2014b) and directly motivating this paper.
Sutskever et al. (2014) found a clever trick — reversing the source sentence so that the first source words are closer to the decoder in the computational graph. This helped, but did not solve the fundamental problem. The bottleneck remained.
Let's make this concrete with a tiny example. Suppose we are translating the French sentence "Le chat est sur le tapis" into English "The cat is on the mat." The encoder processes "Le," updates h1. Processes "chat," updates h2. And so on through h6. Then c = h6 is passed to the decoder.
When the decoder tries to produce "The," it needs information about "Le." But "Le" was processed four steps ago — its information has been overwritten by "est," "sur," "le," and "tapis." The decoder is working from a summary that is biased toward the end of the sentence. For a six-word sentence this is tolerable. For a sixty-word legal document, it is devastating.
Here is Bahdanau's insight: instead of using one fixed context vector, let the decoder compute a different context vector at each step. And instead of relying only on the final encoder state, keep all the encoder hidden states and let the decoder decide which ones matter right now.
But how does the decoder know which source positions are relevant for the current target word? It learns an alignment model — a small neural network that scores how well each source position matches the current decoder state.
At decoding step i, the model computes an alignment score between the decoder's previous hidden state si−1 and each encoder hidden state hj:
The function a is a small feedforward neural network with one hidden layer:
Let's unpack this piece by piece. Wa is a weight matrix that projects the decoder state si−1 into an alignment space of dimension n' (set to 1000 in the paper). Ua is a separate weight matrix that projects the encoder annotation hj into the same space. The two projections are added together and passed through tanh (bounding the result between −1 and +1). Finally, va is a learned weight vector that collapses this n'-dimensional representation into a single scalar score.
This design is called additive attention because the query and key projections are added (not multiplied). Later work by Luong et al. (2015) would show that multiplicative (dot-product) attention — where you simply take the dot product of query and key — works comparably and is faster. But additive attention has an advantage: the tanh bounds the intermediate representation, avoiding the magnitude scaling issues that plague raw dot products in high dimensions.
The scores are then normalized through a softmax to produce attention weights:
Each αij is between 0 and 1, and they sum to 1 across all source positions. You can interpret αij as the probability that target word yi is aligned to source word xj.
Let's trace through a tiny numerical example to make this concrete. Suppose we are at decoding step 3 (generating the third English word), and the source sentence has 4 words. The alignment model computes four scores:
After softmax normalization:
The decoder is overwhelmingly focused on source position 2. The context vector c3 will be almost entirely h2 — the annotation of the second source word. This is a sharp alignment. In practice, alignments can also be diffuse, spreading weight across multiple positions — useful when a single target word corresponds to a multi-word phrase in the source.
Now we have attention weights αij that tell us how much each source position matters for the current decoding step. The next step is beautifully simple: compute a weighted sum of the encoder hidden states.
This is the context vector for decoding step i. Unlike the basic encoder-decoder where c was the same at every step, here ci is different at every step. When generating the subject of the English sentence, ci focuses on the subject of the French sentence. When generating a verb, ci shifts to attend to the French verb.
Think of it as an expected annotation. If we treated the attention weights as a probability distribution over source positions, then ci is the expected value of the encoder annotations under that distribution. This probabilistic interpretation is clean: αij is the probability that target word yi is aligned to source word xj, and ci is the expected source representation under this alignment distribution.
Continuing our numerical example: if α3,2 = 0.93, the context vector c3 ≈ 0.93 · h2 + 0.02 · h1 + 0.01 · h3 + 0.004 · h4. It is almost entirely h2, with tiny contributions from the other positions. The decoder effectively "reads" the second source word while generating the third target word.
The decoder then uses this step-specific context vector to compute its hidden state and predict the next word:
Compare this to the basic encoder-decoder's equation p(yi | ...) = g(yi−1, si, c). The only difference is that c has become ci — but this small change is everything. The decoder now has a dynamic window into the source sentence that shifts at every step.
An analogy: imagine reading a book in a dark room with a flashlight. The basic encoder-decoder glances at the whole page once, turns the light off, then tries to recite the content from memory. The attention model keeps the page illuminated and, at each word of the recitation, moves the flashlight to the relevant passage. The same information is available; the difference is access pattern.
There is an important computational note. Computing αij requires evaluating the alignment model for every source position j at every decoding step i. For a source sentence of length Tx and target of length Ty, this is O(Tx × Ty) alignment computations. This quadratic cost was acceptable for the sentence lengths in machine translation (rarely above 100 words), but it foreshadowed a challenge that the Transformer would inherit — and that later work on efficient attention would try to address.
There is a second innovation in this paper that is easy to overlook. A standard RNN reads the sentence left to right. When it reaches word xj, its hidden state hj summarizes everything from x1 to xj — the past. It knows nothing about xj+1 through xT — the future.
But when attending to a source word, the decoder might need context from both directions. Consider the sentence "the cat that I saw yesterday sat on the mat." When the decoder attends to "cat," it helps to know both that "the" precedes it (the cat is definite) and that "sat" follows it several words later (telling us "cat" is the subject of "sat," not of "saw"). A left-to-right encoder at position "cat" would know about "the" but nothing about "sat." The decoder needs bidirectional context to make the right choices.
Bahdanau uses a bidirectional RNN (BiRNN). Two separate RNNs process the source sentence:
By concatenating the forward and backward states, each annotation hj captures context from both directions. If each RNN has 1000 hidden units, the annotation is a 2000-dimensional vector.
Why does the BiRNN help attention specifically? Because the attention mechanism needs each hj to be a useful summary of "what is happening around position j." A unidirectional state hj only knows the left context. A bidirectional state knows both sides, making it a much richer representation for the alignment model to score against.
The paper uses GRU (Gated Recurrent Unit) cells for both the forward and backward RNNs. A GRU has two gates: an update gate z that controls how much of the old state to keep, and a reset gate r that controls how much of the old state to expose when computing the candidate new state. These gates address the vanishing gradient problem — the update gate can learn to pass information unchanged across many timesteps, like a highway for gradients.
The word embedding matrix E (dimension 620) is shared between the forward and backward RNNs — both directions use the same learned word representations. But the recurrent weight matrices (W, U and their gated variants) are separate, allowing each direction to learn different temporal patterns. The forward RNN might learn that a determiner signals an upcoming noun, while the backward RNN might learn that a period signals the end of a clause — complementary information captured by different parameters.
Let's put all the pieces together. The model the paper calls RNNsearch (because the decoder "searches" the source) has three components wired together end-to-end.
| Component | What It Does | Details |
|---|---|---|
| BiRNN Encoder | Reads source sentence in both directions | 2 × 1000 GRU units → 2000-dim annotations |
| Alignment Model | Scores source positions for each decoder step | Feedforward net: eij = vaT tanh(Wa si−1 + Ua hj) |
| GRU Decoder | Generates target words one at a time | 1000 GRU units, conditioned on ci, yi−1, si−1 |
Here is the complete computation flow for generating one target word yi:
Steps 2-4 repeat for every target word. The encoder runs only once — its annotations are stored and reused. At each step, the attention weights shift to focus on different source positions, giving the decoder a fresh, step-specific view of the source sentence.
Notice the asymmetry: the encoder processes the source in O(Tx) time (one pass of the BiRNN), while the decoder takes O(Ty × Tx) time because it must compute Tx alignment scores at each of the Ty decoding steps. In practice this is not a bottleneck — the alignment computation is a simple matrix-vector product plus a tanh, far cheaper than the RNN state update.
The output layer uses a maxout network (Goodfellow et al., 2013) — a feedforward layer that takes the max of pairs of linear features before projecting to the vocabulary. This gives the output a piecewise-linear function approximation capability.
The decoder's initial hidden state s0 is initialized by passing the backward RNN's first hidden state through a learned transformation:
This is the backward RNN's state after reading the entire source sentence (since it reads right to left, h⃖1 is its final output), giving the decoder a summary of the whole source to start from. Why the backward state and not the forward? Because the backward RNN's h⃖1 has just finished processing all source words, so it captures a global summary — similar to how the basic encoder-decoder uses the final hidden state, but here it only initializes the decoder rather than being the sole source of information.
The model sizes are modest by modern standards: 1000 hidden units per RNN direction, 620-dimensional word embeddings, vocabulary of 30,000 words, alignment model with 1000 hidden units. The total parameter count is in the low millions — a fraction of what modern transformers use — yet the architectural idea proved far more lasting than any particular scale.
Let's trace the dimensions through a concrete example. Source sentence "Le chat" (2 words):
| Quantity | Shape | Description |
|---|---|---|
| x1, x2 | (620,) | Word embeddings for "Le" and "chat" |
| h⃗1, h⃗2 | (1000,) | Forward GRU hidden states |
| h⃖1, h⃖2 | (1000,) | Backward GRU hidden states |
| h1, h2 | (2000,) | Concatenated annotations [h⃗ ; h⃖] |
| ei,1, ei,2 | scalar | Alignment scores at step i |
| αi,1, αi,2 | scalar | Attention weights (sum to 1) |
| ci | (2000,) | Context vector — weighted sum of annotations |
| si | (1000,) | Decoder GRU hidden state |
Notice that the context vector ci has dimension 2000 (because annotations are 2000-dimensional), but the decoder hidden state is only 1000-dimensional. The GRU decoder must therefore have weight matrices C, Cz, Cr of shape (1000 × 2000) to incorporate the context vector into its update equations. This is where the attention information enters the decoder's computation.
Here is pseudocode for one decoding step, showing how all the pieces fit together:
# Given: annotations h[1..Tx], previous state s_prev, previous word y_prev # Step 1: Compute alignment scores for j in range(Tx): e[j] = v_a @ tanh(W_a @ s_prev + U_a @ h[j]) # Step 2: Softmax to get attention weights alpha = softmax(e) # shape: (Tx,), sums to 1 # Step 3: Weighted sum of annotations c = sum(alpha[j] * h[j] for j in range(Tx)) # Step 4: GRU decoder update (with context) z = sigmoid(W_z @ embed(y_prev) + U_z @ s_prev + C_z @ c) r = sigmoid(W_r @ embed(y_prev) + U_r @ s_prev + C_r @ c) s_tilde = tanh(W @ embed(y_prev) + U @ (r * s_prev) + C @ c) s = (1 - z) * s_prev + z * s_tilde # Step 5: Output distribution via maxout + softmax t = maxout(U_o @ s_prev + V_o @ embed(y_prev) + C_o @ c) p_y = softmax(W_o @ t) # probability over vocabulary
The paper's most striking result is not a number — it is a picture. Figure 3 shows attention weights as a heatmap: source words on one axis, target words on the other, brightness indicating attention strength. The patterns are remarkably intuitive: the model learns to "look at" the right source word when generating each target word.
Below is an interactive simulation. Step through the decoding process one word at a time. At each step, watch where the model focuses its attention in the source sentence. The attention weights are simulated to reflect the paper's key finding: the model learns roughly monotonic alignment for similar-order languages, with interesting deviations for reordering (like adjective-noun order in French vs. English).
The heatmap reads like this: each row is a target (English) word, each column is a source (French) word. Bright cells mean high attention weight — the decoder is "looking at" that source word. Dark cells mean the decoder is ignoring that position. As you step through, notice how the bright cells trace a path through the source sentence.
Try the third sentence ("Le chat noir...") and pay attention to the words "black" and "cat." In French, "noir" (black) comes after "chat" (cat), but in English, "black" comes before "cat." Watch the attention pattern cross over — the model has learned to handle this reordering without any explicit rule about adjective placement.
The paper evaluates on English-to-French translation using the WMT '14 dataset (348M words of parallel corpora after filtering). Two models are compared: RNNencdec (the basic encoder-decoder) and RNNsearch (the proposed attention model). Each is trained on sentences up to length 30 and 50.
| Model | Max Length | BLEU (all) | BLEU (no UNK) |
|---|---|---|---|
| RNNencdec-30 | 30 | 21.27 | 24.19 |
| RNNencdec-50 | 50 | 17.82 | 20.87 |
| RNNsearch-30 | 30 | 26.75 | 31.44 |
| RNNsearch-50 | 50 | 28.45 | 33.36 |
| Moses (phrase-based) | — | 33.30 | — |
Several things stand out from these results:
Attention closes the gap. RNNsearch-50 (BLEU 28.45) dramatically outperforms RNNencdec-50 (17.82) — a gain of over 10 BLEU points. On sentences with no unknown words, RNNsearch-50 reaches 33.36, comparable to the phrase-based Moses system (33.30).
Long sentences no longer collapse. The paper's Figure 2 is the most telling result. For RNNencdec, BLEU drops steeply after sentence length 20. For RNNsearch, performance remains stable even at length 50+. The attention mechanism has eliminated the fixed-length bottleneck.
RNNencdec-50 is worse than RNNencdec-30. This is counterintuitive — training on longer sentences should help. But it actually hurts the basic encoder-decoder because the longer sentences introduce more information that must be crammed into the same fixed vector. The model is being asked to compress more content into the same bottleneck, and it fails. For RNNsearch, training on longer sentences helps as expected — the attention mechanism scales gracefully because the number of annotations grows with sentence length, and the decoder can access all of them.
Single model vs. pipeline. A remarkable aspect of these results is that RNNsearch is a single end-to-end neural network. Moses, the phrase-based system it competes with, is a complex pipeline of separately trained components: word alignments, phrase extraction, language model, reordering model, and feature combination. Yet a single neural model with attention approaches its quality. This simplification — one model replacing an entire pipeline — would become a recurring theme in deep learning's subsequent takeover of NLP.
Training details: both models use GRU cells with 1000 hidden units, minibatch SGD with Adadelta, batch size 80, trained for approximately 5 days on a single GPU. Beam search with width 12 is used at inference. The vocabulary is limited to the 30,000 most frequent words in each language, with unknown words mapped to a special [UNK] token.
Qualitative analysis. The paper also examines the learned alignments qualitatively. For simple sentences with similar word order in English and French, the attention weights form a near-perfect diagonal — word 1 maps to word 1, word 2 to word 2, and so on. But for sentences with reordering (like adjective-noun order), the attention pattern deviates from the diagonal in linguistically sensible ways. The model has learned grammar without being taught it.
The paper also shows full translations of long sentences (30+ words) comparing RNNencdec and RNNsearch. The RNNencdec translations often lose the thread midway — substituting incorrect words or producing ungrammatical output. For example, on a 36-word sentence about a doctor's admitting privilege, RNNencdec produced "to recognize a patient at the hospital" instead of "to admit a patient to a hospital." The meaning was mangled. RNNsearch produced the correct translation, nearly matching the reference.
This qualitative evidence was almost as persuasive as the BLEU numbers. Researchers could read the translations and see that the attention model maintained coherence over long distances, while the basic model degraded into paraphrase-like approximations that missed key details.
This paper introduced a mechanism — attention — that would grow far beyond machine translation. Let's trace its impact.
It reframed the problem. Before Bahdanau, the encoder-decoder was a compression problem: squeeze the source into one vector. After Bahdanau, it became a retrieval problem: store all source information and selectively access it. This shift in perspective was profound. The encoder went from being a compressor to being a memory writer, and the decoder went from being a generator to being a memory reader. This memory-access framing would later evolve into the Query-Key-Value formulation of the Transformer.
It made alignment differentiable. Traditional machine translation relied on separate alignment models (IBM Models 1-5, HMM alignment) trained with EM. These were hard, discrete alignments — each target word was linked to exactly one source position, and this assignment was a latent variable optimized through expectation-maximization. Bahdanau's soft attention made alignment a continuous, differentiable operation trainable with standard backpropagation. The alignment weights αij are just softmax outputs — fully differentiable. Gradients flow through them, the alignment model improves, and no separate training stage is needed. This eliminated an entire pipeline stage.
It inspired the Transformer. Vaswani et al. (2017) replaced the additive attention of this paper with scaled dot-product attention and asked: what if attention were the only mechanism? No RNN, no convolution — just attention. The Transformer was born, and with it, GPT, BERT, and the entire modern era of large language models. In a very real sense, the Transformer is what you get when you take Bahdanau attention, make it faster (dot-product instead of additive), give it multiple heads, apply it within a sequence (self-attention, not just cross-attention), and remove the RNN entirely.
| Evolution | Architecture | How Attention Is Used |
|---|---|---|
| Before Bahdanau | Encoder-decoder | No attention — fixed context vector |
| This paper (2014) | BiRNN + attention | Cross-attention: decoder attends to encoder |
| Luong et al. (2015) | Stacked LSTM + attention | Global vs. local attention, dot-product scoring |
| Transformer (2017) | Pure attention | Self-attention + cross-attention, multi-head |
The paper also established a methodology that became standard: visualizing attention weights as interpretability evidence. The alignment heatmaps showed that the model was doing something sensible — attending to the right words — which built confidence in neural approaches to translation at a time when they were still regarded with skepticism.
Additive vs. dot-product attention. Bahdanau used additive attention: eij = vaT tanh(Wa si−1 + Ua hj). This involves a small neural network with learnable parameters. Luong et al. (2015) later showed that a simpler dot-product score eij = si−1T hj works comparably and is faster because it can be implemented as a single matrix multiply. The Transformer adopted scaled dot-product attention, but the principle — score, normalize, aggregate — is identical to what Bahdanau introduced.
| Scoring Function | Formula | Parameters | Speed |
|---|---|---|---|
| Additive (Bahdanau) | vT tanh(W1q + W2k) | W1, W2, v | Slower (FFN) |
| Dot-product (Luong) | qTk | None | Fast (matmul) |
| Scaled dot-product (Vaswani) | qTk / √dk | None | Fast (matmul) |
This paper sits at a critical junction in the history of deep learning. Let's trace the threads forward and backward.
The paper also connects laterally to work on visual attention. Xu et al. (2015) applied essentially the same mechanism to image captioning — the decoder attends to different spatial regions of a CNN feature map when generating each word. The "show, attend and tell" model was directly inspired by this paper. In speech recognition, Chorowski et al. (2015) adapted Bahdanau attention for audio-to-text, attending to different time frames of a spectrogram.
Even outside sequence-to-sequence tasks, the attention paradigm spread rapidly. Memory networks (Sukhbaatar et al., 2015) used attention to read from external memory. Pointer networks (Vinyals et al., 2015) used attention to point at input positions. Graph attention networks (Velickovic et al., 2018) used attention to weight messages from neighbors. All trace their lineage back to this paper.
In a broader sense, what Bahdanau introduced is a form of content-based addressing — the decoder does not access source information by position (first word, last word) but by content (which source position is most relevant right now?). This content-based addressing is the same principle that powers modern retrieval-augmented generation (RAG), cross-attention in diffusion models, and the key-value caching that makes LLM inference efficient. The idea has proven to be one of the most portable innovations in all of deep learning.
It is worth noting the timing. This paper was submitted to arXiv in September 2014 — just two months after Sutskever et al.'s sequence-to-sequence paper (July 2014). Within weeks of the encoder-decoder becoming established, its fundamental limitation was identified and solved. The speed of this progression — problem, solution, and generalization (the Transformer in 2017) in just three years — makes this one of the fastest conceptual evolutions in the history of machine learning.
Paper details. "Neural Machine Translation by Jointly Learning to Align and Translate," Dzmitry Bahdanau, KyungHyun Cho, Yoshua Bengio. ICLR 2015. arXiv:1409.0473. First submitted September 2014. Over 40,000 citations.