When fixed filters aren't enough — learn to dynamically focus on what matters, anywhere in the sequence.
Consider this sentence: "The cat sat on the mat because it was tired." What does "it" refer to? Obviously the cat, not the mat. You knew this instantly. But how?
Your brain didn't process each word with a fixed-size filter sliding across the sentence (that's what a CNN would do). Instead, you looked back at relevant words — "cat" is far from "it" in the sequence, yet you connected them effortlessly. This ability to dynamically focus on relevant parts of the input, regardless of distance, is attention.
CNNs: Each layer has a fixed receptive field (e.g., 3 tokens). To connect tokens 20 positions apart, you need 7+ layers of 3-wide convolutions. Information must propagate through all those layers — slow and lossy.
RNNs: Process sequentially, maintaining a hidden state. Token 1 must survive through 19 sequential compression steps to influence token 20. In practice, long-range dependencies get forgotten (even with LSTMs).
Attention: Every token can directly attend to every other token in one operation. No distance penalty. The model learns which connections matter.
| Domain | Example | What attention does |
|---|---|---|
| Text | "it" → "cat" | Resolve pronoun references across long distances |
| Vision | Dog vs. cat image | Focus on the animal, ignore the background |
| Audio-visual | Speaker in a crowd | Attend to the face that matches the voice |
Click on different words to see what a trained attention model focuses on (highlighted). Notice how "it" attends strongly to "cat" despite being 6 tokens away.
The attention mechanism uses three concepts borrowed from database retrieval: queries, keys, and values.
Analogy: You're at a library. You have a question (query). Each book has a title and summary on the spine (key). When your question matches a book's description, you pull it off the shelf and read the content (value). The better the match between your query and the key, the more you rely on that book's value.
Given an input sequence X = [x1, ..., xn]T (n tokens, each a d-dimensional vector), we compute three matrices:
where WQ, WK ∈ Rd×dk and WV ∈ Rd×dv are learned projection matrices.
In self-attention, queries, keys, and values all come from the same input X. Each token queries every other token (including itself). This is how "it" can query the whole sentence and find "cat" as the best match.
Sequence of 3 tokens, d = 4, dk = dv = 2. Token embeddings:
Suppose WQ projects to queries that capture "what am I looking for?" and WK captures "what do I contain?". After projection (simplified):
Score matrix QKT:
Token 3's query [1,1] matches key [1,1] (token 3 itself) with score 2, and matches the others with score 1. After softmax, token 3 attends most strongly to itself.
Three tokens with their query (arrow) and key (circle) vectors in 2D. The dot product between query and key determines attention strength. Drag tokens to see how scores change.
Now we assemble the full attention formula. The output for each query is a weighted sum of values, where weights are determined by query-key similarity:
Let's break this apart piece by piece.
Entry Sij = qiT kj measures how much query i is compatible with key j. Higher dot product = better match = more attention.
Why scale? Without it, for large dk, the dot products grow in magnitude (variance of qTk is proportional to dk if q and k have unit variance entries). Large values push softmax into saturation, where gradients are tiny. Dividing by √dk keeps the variance at 1, maintaining softmax in its useful range.
Each row of A is a probability distribution: Aij = exp(S̃ij) / ∑j' exp(S̃ij'). Row i tells us how token i distributes its attention across all tokens.
Output token i is ∑j Aij Vj — a convex combination of all value vectors, weighted by attention.
Three tokens, dk = 2, dv = 2.
Scores: QKT = [[1, 0, 1], [1, 1, 0], [2, 1, 1]]
Scale: ÷ √2 = [[0.71, 0, 0.71], [0.71, 0.71, 0], [1.41, 0.71, 0.71]]
Softmax (row-wise):
Output: O1 = 0.39[1,2] + 0.22[3,4] + 0.39[5,6] = [3.00, 4.00]. Token 1 attends equally to tokens 1 and 3 (both had score 0.71), drawing from both their values.
See the full attention computation step by step. Adjust the scaling factor to see why √dk matters: without it, softmax saturates.
A single attention head computes one set of attention weights. But language (and signals) have multiple types of relationships simultaneously. In "The cat sat on the mat because it was tired":
| Relationship type | Connection |
|---|---|
| Coreference | "it" → "cat" |
| Spatial | "sat" → "mat" |
| Causal | "tired" → "because" |
| Subject-verb | "cat" → "sat" |
One attention pattern can't capture all these simultaneously. Solution: run multiple attention heads in parallel, each with its own WQ, WK, WV projections.
For H heads, each head h uses its own projections:
Each projection maps to dimension dk = dv = d/H (so the total computation is the same as a single d-dimensional head).
Concatenate all heads and project back:
where WO ∈ Rd×d mixes information across heads.
For model dimension d = 512 with H = 8 heads:
| Component | Shape | Parameters |
|---|---|---|
| WQh (per head) | 512 × 64 | 32,768 |
| WKh (per head) | 512 × 64 | 32,768 |
| WVh (per head) | 512 × 64 | 32,768 |
| All 8 heads (Q+K+V) | 8 × 3 × 32,768 = 786,432 | |
| WO | 512 × 512 | 262,144 |
| Total | ~1M parameters |
Four heads attend to the same sentence. Each head learns a different attention pattern. Click a head to highlight its connections.
The Transformer (Vaswani et al., 2017) is not just attention — it's a specific architecture that wraps multi-head attention in a block with feed-forward networks, residual connections, and layer normalization. This block is then stacked L times.
Multi-head attention: Lets tokens exchange information. Without this, each token only knows about itself.
Feed-forward network (FFN): Processes each token independently (same FFN applied to each position). This is where per-token computation happens — think of it as "thinking about what I've gathered from attention." The expansion to 4d and back allows the network to compute complex functions of the attended information.
Residual connections: Same purpose as in ResNets — gradient flow and easy identity learning. Without them, stacking 96 layers (GPT-3) would be impossible to train.
Layer normalization: Like BatchNorm but computed per-token (across the d-dimensional feature, not across the batch). Stabilizes training without depending on batch size.
where W1 ∈ R4d×d and W2 ∈ Rd×4d.
| Component | Parameters |
|---|---|
| Multi-head attention (Q, K, V, O) | ~1.05M |
| FFN (W1, b1, W2, b2) | 512×2048 + 2048 + 2048×512 + 512 = ~2.1M |
| Layer norms (2 ×) | 2 × 2 × 512 = 2,048 |
| Total per block | ~3.15M |
| 6-layer Transformer | ~19M |
Click each component to see data flow through a Transformer block. Watch how the residual connections preserve information while attention and FFN add to it.
Click to step throughThe Transformer was invented for machine translation (2017), but its architecture is fundamentally about processing sequences — and signals are sequences. Within 3 years, Transformers achieved state-of-the-art results across signal processing.
Traditional pipeline: audio → STFT → mel filterbank → CNN/RNN features → decoder. The Transformer replaces the entire CNN/RNN stack:
Input: sequence of mel-spectrogram frames (each frame is a "token"). Self-attention lets frame 50 attend directly to frame 5 — capturing long-range dependencies like repeated words or prosodic patterns across an entire utterance.
Whisper (OpenAI, 2022): Transformer encoder-decoder trained on 680K hours of audio. The encoder uses self-attention over spectrogram frames. The decoder uses cross-attention to attend to encoder outputs while generating text token by token.
Separating vocals from instruments requires understanding long-range structure (verse patterns, repetitions). Attention captures this: the vocal pattern at measure 5 should be similar to measure 13 (same melody). A CNN would need very deep networks to see across 8 measures.
Split an image into 16×16 patches. Flatten each patch into a vector. Treat patches as "tokens" and apply a standard Transformer. Self-attention lets distant patches communicate directly — a patch in the top-left can attend to a patch in the bottom-right without any pooling hierarchy.
For sensor signals, financial data, or EEG: each time step is a token. Attention learns which past time steps are most predictive of the future. Unlike an AR model (fixed window), attention can learn to attend to periodic events (e.g., heartbeats) regardless of variable spacing.
A periodic signal with anomalies. The attention weights show which past time steps are most relevant for predicting the highlighted position. Notice it attends to the same phase of previous cycles.
Two frontier applications that build on attention: (1) generating signals from text descriptions, and (2) making large models smaller without losing quality.
Diffusion models generate data by learning to reverse a noise-adding process. Starting from pure noise, the model predicts and removes noise step by step until a clean signal (image, audio) emerges.
How does text conditioning work? Cross-attention. The diffusion model's intermediate features are the queries; the text embedding (from a separate text encoder) provides keys and values:
Each spatial position in the generated image attends to relevant words in the text prompt. "A red car on a beach" → car pixels attend to "red car," sky pixels attend to "beach." This is how Stable Diffusion, DALL-E, and MusicGen work.
Large Transformers (GPT-4: ~1.8T parameters) are expensive to deploy. Three compression strategies leverage the structure of attention:
Pruning attention heads. Not all heads are useful after training. Many can be removed with minimal quality loss. Some heads are redundant; others attend to trivial patterns (always attend to the previous token).
Knowledge distillation. Train a small "student" Transformer to match the attention patterns and outputs of a large "teacher." The student learns which attention patterns matter most.
Low-rank factorization. The attention matrix A = softmax(QKT/√dk) is n×n but often has effective rank much lower than n. Low-rank approximations reduce computation from O(n2) to O(n·r) where r « n.
| Method | Compression | Trade-off |
|---|---|---|
| Head pruning | Remove 30-50% of heads | ~1% accuracy loss |
| Distillation | 6× smaller model | ~3% accuracy loss |
| Low-rank attention | O(n) instead of O(n2) | Approximation error on long contexts |
| Quantization | 16-bit → 4-bit weights | Minimal loss with careful calibration |
A text prompt conditions signal generation via cross-attention. See which text tokens each signal position attends to most strongly.
From the "it" problem to generating images from text — attention is the mechanism that unlocked it all. Let's consolidate.
| Component | Formula | Purpose |
|---|---|---|
| Query/Key/Value | Q = XWQ, K = XWK, V = XWV | Project into comparison space |
| Scaled dot-product | softmax(QKT/√dk)V | Content-based weighted sum |
| Multi-head | Concat(head1,...,headH)WO | Multiple relationship types |
| Residual + LayerNorm | X + Sublayer(LN(X)) | Gradient flow + stability |
| FFN | W2 ReLU(W1x) | Per-token nonlinear processing |
| Cross-attention | Q from signal, K/V from text | Condition on another modality |
| EE Domain | Before Transformers | With Transformers |
|---|---|---|
| Speech recognition | HMM + CNN/RNN | Whisper (Transformer encoder-decoder) |
| Audio generation | WaveNet (dilated CNN) | MusicGen (Transformer + cross-attention) |
| Signal separation | NMF, deep clustering | SepFormer (dual-path attention) |
| Radar/sonar | CFAR + matched filter | Attention on range-Doppler maps |
| Channel estimation | LS, MMSE | Transformer on pilot symbols |
Self-attention is O(n2) in sequence length (every token attends to every other). For n = 100K tokens, that's 10 billion attention scores. Active research: linear attention, sparse attention, sliding window attention, and state-space models (Mamba) aim to maintain the benefits while reducing to O(n) or O(n log n).
"Attention is all you need." — Vaswani et al., 2017