EE269 Lecture 25 — Mert Pilanci, Stanford

Attention & Transformers

When fixed filters aren't enough — learn to dynamically focus on what matters, anywhere in the sequence.

Prerequisites: Neural networks (forward/backward pass) + Matrix multiplication. That's it.
8
Chapters
7+
Simulations
0
Assumed Knowledge

Chapter 0: Why Attention?

Consider this sentence: "The cat sat on the mat because it was tired." What does "it" refer to? Obviously the cat, not the mat. You knew this instantly. But how?

Your brain didn't process each word with a fixed-size filter sliding across the sentence (that's what a CNN would do). Instead, you looked back at relevant words — "cat" is far from "it" in the sequence, yet you connected them effortlessly. This ability to dynamically focus on relevant parts of the input, regardless of distance, is attention.

The Limitations of CNNs and RNNs

CNNs: Each layer has a fixed receptive field (e.g., 3 tokens). To connect tokens 20 positions apart, you need 7+ layers of 3-wide convolutions. Information must propagate through all those layers — slow and lossy.

RNNs: Process sequentially, maintaining a hidden state. Token 1 must survive through 19 sequential compression steps to influence token 20. In practice, long-range dependencies get forgotten (even with LSTMs).

Attention: Every token can directly attend to every other token in one operation. No distance penalty. The model learns which connections matter.

Attention = dynamic, content-based routing. Unlike convolution (fixed pattern, shifted across positions) or fully-connected (fixed weights for fixed positions), attention computes connections based on the content of the input itself. The same layer can connect "it" to "cat" in one sentence and "it" to "house" in another. The routing changes with the input.

Three Types of Attention

DomainExampleWhat attention does
Text"it" → "cat"Resolve pronoun references across long distances
VisionDog vs. cat imageFocus on the animal, ignore the background
Audio-visualSpeaker in a crowdAttend to the face that matches the voice
The "it" Resolution Problem

Click on different words to see what a trained attention model focuses on (highlighted). Notice how "it" attends strongly to "cat" despite being 6 tokens away.

The key advantage of attention over CNNs for sequence modeling is:

Chapter 1: Query-Key-Value

The attention mechanism uses three concepts borrowed from database retrieval: queries, keys, and values.

Analogy: You're at a library. You have a question (query). Each book has a title and summary on the spine (key). When your question matches a book's description, you pull it off the shelf and read the content (value). The better the match between your query and the key, the more you rely on that book's value.

Formal Definition

Given an input sequence X = [x1, ..., xn]T (n tokens, each a d-dimensional vector), we compute three matrices:

Q = X WQ    (n × dk) — queries
K = X WK    (n × dk) — keys
V = X WV    (n × dv) — values

where WQ, WK ∈ Rd×dk and WV ∈ Rd×dv are learned projection matrices.

Why three separate projections? The same token plays different roles: "cat" as a query asks "what did I do?"; as a key it advertises "I'm an animate noun"; as a value it contributes its semantic content to other tokens' representations. Separating Q, K, V lets the network learn these distinct roles independently.

Self-Attention: Queries Come From the Same Sequence

In self-attention, queries, keys, and values all come from the same input X. Each token queries every other token (including itself). This is how "it" can query the whole sentence and find "cat" as the best match.

Data Flow

Input X (n × d)
n tokens, each d-dimensional
↓ × WQ, WK, WV
Q (n × dk), K (n × dk), V (n × dv)
Three projections of the same input
↓ Q KT
Scores (n × n)
How much each token attends to every other
↓ softmax × V
Output (n × dv)
Weighted combination of values

Worked Example

Sequence of 3 tokens, d = 4, dk = dv = 2. Token embeddings:

X = [[1, 0, 1, 0], [0, 1, 0, 1], [1, 1, 0, 0]]

Suppose WQ projects to queries that capture "what am I looking for?" and WK captures "what do I contain?". After projection (simplified):

Q = [[1, 0], [0, 1], [1, 1]]    K = [[0, 1], [1, 0], [1, 1]]

Score matrix QKT:

QKT = [[0, 1, 1], [1, 0, 1], [1, 1, 2]]

Token 3's query [1,1] matches key [1,1] (token 3 itself) with score 2, and matches the others with score 1. After softmax, token 3 attends most strongly to itself.

Query-Key Matching

Three tokens with their query (arrow) and key (circle) vectors in 2D. The dot product between query and key determines attention strength. Drag tokens to see how scores change.

In self-attention, Q, K, and V are all computed from:

Chapter 2: Scaled Dot-Product Attention

Now we assemble the full attention formula. The output for each query is a weighted sum of values, where weights are determined by query-key similarity:

Attention(Q, K, V) = softmax(Q KT / √dk) V

Let's break this apart piece by piece.

Step 1: Compute Scores

S = Q KT    (n × n matrix)

Entry Sij = qiT kj measures how much query i is compatible with key j. Higher dot product = better match = more attention.

Step 2: Scale

S̃ = S / √dk

Why scale? Without it, for large dk, the dot products grow in magnitude (variance of qTk is proportional to dk if q and k have unit variance entries). Large values push softmax into saturation, where gradients are tiny. Dividing by √dk keeps the variance at 1, maintaining softmax in its useful range.

The scaling factor demystified. If each component of q and k is drawn independently with mean 0 and variance 1, then E[qTk] = 0 and Var(qTk) = dk. After dividing by √dk, the variance becomes 1. This keeps softmax inputs in the range where it produces non-degenerate probability distributions (not all mass on one token).

Step 3: Softmax

A = softmax(S̃)    (applied row-wise)

Each row of A is a probability distribution: Aij = exp(S̃ij) / ∑j' exp(S̃ij'). Row i tells us how token i distributes its attention across all tokens.

Step 4: Weighted Sum of Values

Output = A V    (n × dv)

Output token i is ∑j Aij Vj — a convex combination of all value vectors, weighted by attention.

Complete Numerical Example

Three tokens, dk = 2, dv = 2.

Q = [[1, 0], [0, 1], [1, 1]]    K = [[1, 1], [0, 1], [1, 0]]    V = [[1, 2], [3, 4], [5, 6]]

Scores: QKT = [[1, 0, 1], [1, 1, 0], [2, 1, 1]]

Scale: ÷ √2 = [[0.71, 0, 0.71], [0.71, 0.71, 0], [1.41, 0.71, 0.71]]

Softmax (row-wise):

A1 = [0.39, 0.22, 0.39]
A2 = [0.39, 0.39, 0.22]
A3 = [0.49, 0.26, 0.26]

Output: O1 = 0.39[1,2] + 0.22[3,4] + 0.39[5,6] = [3.00, 4.00]. Token 1 attends equally to tokens 1 and 3 (both had score 0.71), drawing from both their values.

Attention Score Computation

See the full attention computation step by step. Adjust the scaling factor to see why √dk matters: without it, softmax saturates.

Scale factor 1.41
The scaling by 1/√dk in attention is needed because:

Chapter 3: Multi-Head Attention

A single attention head computes one set of attention weights. But language (and signals) have multiple types of relationships simultaneously. In "The cat sat on the mat because it was tired":

Relationship typeConnection
Coreference"it" → "cat"
Spatial"sat" → "mat"
Causal"tired" → "because"
Subject-verb"cat" → "sat"

One attention pattern can't capture all these simultaneously. Solution: run multiple attention heads in parallel, each with its own WQ, WK, WV projections.

The Multi-Head Formula

For H heads, each head h uses its own projections:

headh = Attention(X WQh, X WKh, X WVh)

Each projection maps to dimension dk = dv = d/H (so the total computation is the same as a single d-dimensional head).

Concatenate all heads and project back:

MultiHead(X) = [head1; head2; ...; headH] WO

where WO ∈ Rd×d mixes information across heads.

Parameter Count

For model dimension d = 512 with H = 8 heads:

ComponentShapeParameters
WQh (per head)512 × 6432,768
WKh (per head)512 × 6432,768
WVh (per head)512 × 6432,768
All 8 heads (Q+K+V)8 × 3 × 32,768 = 786,432
WO512 × 512262,144
Total~1M parameters
Heads specialize. After training, different heads learn to attend to different linguistic/signal properties. In language models, researchers have found heads that track: subject-verb agreement, positional proximity, rare word detection, and syntactic arcs. Each head "looks at the data differently" through its learned projections.
Multi-Head Attention Visualization

Four heads attend to the same sentence. Each head learns a different attention pattern. Click a head to highlight its connections.

Active head Head 1: coreference
Multi-head attention with H=8 heads and model dimension d=512 uses per-head dimension:

Chapter 4: The Transformer Block

The Transformer (Vaswani et al., 2017) is not just attention — it's a specific architecture that wraps multi-head attention in a block with feed-forward networks, residual connections, and layer normalization. This block is then stacked L times.

One Transformer Block

Input X (n × d)
Token embeddings + positional encoding
Layer Norm
Normalize each token independently
Multi-Head Self-Attention
Each token attends to all others
↓ + X (residual)
Layer Norm
Normalize again
Feed-Forward Network (FFN)
Two linear layers with ReLU: d → 4d → d
↓ + (residual)
Output (n × d)
Same shape as input — blocks can stack

Why Each Component Exists

Multi-head attention: Lets tokens exchange information. Without this, each token only knows about itself.

Feed-forward network (FFN): Processes each token independently (same FFN applied to each position). This is where per-token computation happens — think of it as "thinking about what I've gathered from attention." The expansion to 4d and back allows the network to compute complex functions of the attended information.

Residual connections: Same purpose as in ResNets — gradient flow and easy identity learning. Without them, stacking 96 layers (GPT-3) would be impossible to train.

Layer normalization: Like BatchNorm but computed per-token (across the d-dimensional feature, not across the batch). Stabilizes training without depending on batch size.

The Transformer's superpower is composability. Because input and output have the same shape (n × d), blocks stack trivially. Each additional block lets tokens gather information from tokens that have already gathered information — enabling multi-hop reasoning. GPT-4 uses ~120 blocks; each one refines the representation.

Complete Equations for One Block

X' = X + MultiHead(LayerNorm(X))
Output = X' + FFN(LayerNorm(X'))
FFN(z) = W2 · ReLU(W1 z + b1) + b2

where W1 ∈ R4d×d and W2 ∈ Rd×4d.

Parameter Count (One Block, d=512, H=8)

ComponentParameters
Multi-head attention (Q, K, V, O)~1.05M
FFN (W1, b1, W2, b2)512×2048 + 2048 + 2048×512 + 512 = ~2.1M
Layer norms (2 ×)2 × 2 × 512 = 2,048
Total per block~3.15M
6-layer Transformer~19M
Interactive Transformer Block

Click each component to see data flow through a Transformer block. Watch how the residual connections preserve information while attention and FFN add to it.

Click to step through
The feed-forward network in a Transformer block processes:

Chapter 5: Signal Processing Applications

The Transformer was invented for machine translation (2017), but its architecture is fundamentally about processing sequences — and signals are sequences. Within 3 years, Transformers achieved state-of-the-art results across signal processing.

Audio: Speech Recognition & Enhancement

Traditional pipeline: audio → STFT → mel filterbank → CNN/RNN features → decoder. The Transformer replaces the entire CNN/RNN stack:

Input: sequence of mel-spectrogram frames (each frame is a "token"). Self-attention lets frame 50 attend directly to frame 5 — capturing long-range dependencies like repeated words or prosodic patterns across an entire utterance.

Whisper (OpenAI, 2022): Transformer encoder-decoder trained on 680K hours of audio. The encoder uses self-attention over spectrogram frames. The decoder uses cross-attention to attend to encoder outputs while generating text token by token.

Music: Source Separation

Separating vocals from instruments requires understanding long-range structure (verse patterns, repetitions). Attention captures this: the vocal pattern at measure 5 should be similar to measure 13 (same melody). A CNN would need very deep networks to see across 8 measures.

Images: Vision Transformers (ViT)

Split an image into 16×16 patches. Flatten each patch into a vector. Treat patches as "tokens" and apply a standard Transformer. Self-attention lets distant patches communicate directly — a patch in the top-left can attend to a patch in the bottom-right without any pooling hierarchy.

Time Series Forecasting

For sensor signals, financial data, or EEG: each time step is a token. Attention learns which past time steps are most predictive of the future. Unlike an AR model (fixed window), attention can learn to attend to periodic events (e.g., heartbeats) regardless of variable spacing.

The common thread. In every domain, the same pattern emerges: (1) Tokenize the signal into a sequence. (2) Apply self-attention to let each token gather relevant context from the entire sequence. (3) Stack blocks to build hierarchical understanding. The same architecture handles text, audio, images, and sensor data with minimal domain-specific changes.
Attention on a Time Series

A periodic signal with anomalies. The attention weights show which past time steps are most relevant for predicting the highlighted position. Notice it attends to the same phase of previous cycles.

Query position 45
The key advantage of Transformers over CNNs for audio signals is:

Chapter 6: Text-Conditioned Diffusion & Model Compression

Two frontier applications that build on attention: (1) generating signals from text descriptions, and (2) making large models smaller without losing quality.

Text-Conditioned Diffusion

Diffusion models generate data by learning to reverse a noise-adding process. Starting from pure noise, the model predicts and removes noise step by step until a clean signal (image, audio) emerges.

How does text conditioning work? Cross-attention. The diffusion model's intermediate features are the queries; the text embedding (from a separate text encoder) provides keys and values:

Q = Z WQ   (from noisy signal features)
K = T WK,   V = T WV   (from text embedding T)
CrossAttention = softmax(QKT / √dk) V

Each spatial position in the generated image attends to relevant words in the text prompt. "A red car on a beach" → car pixels attend to "red car," sky pixels attend to "beach." This is how Stable Diffusion, DALL-E, and MusicGen work.

Cross-attention = the bridge between modalities. Text and images live in different spaces. Cross-attention lets one modality's features (Q) query another modality's representations (K, V). The same mechanism works for audio-text (MusicGen), video-text (Sora), and any cross-modal generation.

Model Compression

Large Transformers (GPT-4: ~1.8T parameters) are expensive to deploy. Three compression strategies leverage the structure of attention:

Pruning attention heads. Not all heads are useful after training. Many can be removed with minimal quality loss. Some heads are redundant; others attend to trivial patterns (always attend to the previous token).

Knowledge distillation. Train a small "student" Transformer to match the attention patterns and outputs of a large "teacher." The student learns which attention patterns matter most.

Low-rank factorization. The attention matrix A = softmax(QKT/√dk) is n×n but often has effective rank much lower than n. Low-rank approximations reduce computation from O(n2) to O(n·r) where r « n.

MethodCompressionTrade-off
Head pruningRemove 30-50% of heads~1% accuracy loss
Distillation6× smaller model~3% accuracy loss
Low-rank attentionO(n) instead of O(n2)Approximation error on long contexts
Quantization16-bit → 4-bit weightsMinimal loss with careful calibration
Cross-Attention: Text → Signal

A text prompt conditions signal generation via cross-attention. See which text tokens each signal position attends to most strongly.

In text-conditioned diffusion (e.g., Stable Diffusion), the text influences the generated image through:

Chapter 7: Mastery

From the "it" problem to generating images from text — attention is the mechanism that unlocked it all. Let's consolidate.

ComponentFormulaPurpose
Query/Key/ValueQ = XWQ, K = XWK, V = XWVProject into comparison space
Scaled dot-productsoftmax(QKT/√dk)VContent-based weighted sum
Multi-headConcat(head1,...,headH)WOMultiple relationship types
Residual + LayerNormX + Sublayer(LN(X))Gradient flow + stability
FFNW2 ReLU(W1x)Per-token nonlinear processing
Cross-attentionQ from signal, K/V from textCondition on another modality

The Transformer's Impact on EE

EE DomainBefore TransformersWith Transformers
Speech recognitionHMM + CNN/RNNWhisper (Transformer encoder-decoder)
Audio generationWaveNet (dilated CNN)MusicGen (Transformer + cross-attention)
Signal separationNMF, deep clusteringSepFormer (dual-path attention)
Radar/sonarCFAR + matched filterAttention on range-Doppler maps
Channel estimationLS, MMSETransformer on pilot symbols

Computational Cost: The Quadratic Bottleneck

Self-attention is O(n2) in sequence length (every token attends to every other). For n = 100K tokens, that's 10 billion attention scores. Active research: linear attention, sparse attention, sliding window attention, and state-space models (Mamba) aim to maintain the benefits while reducing to O(n) or O(n log n).

Related lessons.
Lecture 23: Neural Networks — foundations of forward/backward pass
Lecture 24: Deep Learning & CNNs — residual connections, depth
Lecture 8: STFT — spectrograms that Transformers process

"Attention is all you need." — Vaswani et al., 2017