EE269 Lecture 25 — Attention & Transformers

Chapter 0: Why Attention?

Consider this sentence: "The cat sat on the mat because it was tired." What does "it" refer to? Obviously the cat, not the mat. You knew this instantly. But how?

Your brain didn't process each word with a fixed-size filter sliding across the sentence (that's what a CNN would do). Instead, you looked back at relevant words — "cat" is far from "it" in the sequence, yet you connected them effortlessly. This ability to dynamically focus on relevant parts of the input, regardless of distance, is attention.

The Limitations of CNNs and RNNs

CNNs: Each layer has a fixed receptive field (e.g., 3 tokens). To connect tokens 20 positions apart, you need 7+ layers of 3-wide convolutions. Information must propagate through all those layers — slow and lossy.

RNNs: Process sequentially, maintaining a hidden state. Token 1 must survive through 19 sequential compression steps to influence token 20. In practice, long-range dependencies get forgotten (even with LSTMs).

Attention: Every token can directly attend to every other token in one operation. No distance penalty. The model learns which connections matter.

Attention = dynamic, content-based routing. Unlike convolution (fixed pattern, shifted across positions) or fully-connected (fixed weights for fixed positions), attention computes connections based on the content of the input itself. The same layer can connect "it" to "cat" in one sentence and "it" to "house" in another. The routing changes with the input.

Three Types of Attention

Domain	Example	What attention does
Text	"it" → "cat"	Resolve pronoun references across long distances
Vision	Dog vs. cat image	Focus on the animal, ignore the background
Audio-visual	Speaker in a crowd	Attend to the face that matches the voice

The "it" Resolution Problem

Click on different words to see what a trained attention model focuses on (highlighted). Notice how "it" attends strongly to "cat" despite being 6 tokens away.

The key advantage of attention over CNNs for sequence modeling is:

Any token can directly attend to any other token in one step, regardless of distance Attention uses fewer parameters than convolution Attention is faster to compute than convolution

Chapter 1: Query-Key-Value

The attention mechanism uses three concepts borrowed from database retrieval: queries, keys, and values.

Analogy: You're at a library. You have a question (query). Each book has a title and summary on the spine (key). When your question matches a book's description, you pull it off the shelf and read the content (value). The better the match between your query and the key, the more you rely on that book's value.

Formal Definition

Given an input sequence X = [x₁, ..., x_n]^T (n tokens, each a d-dimensional vector), we compute three matrices:

Q = X W_Q (n × d_k) — queries

K = X W_K (n × d_k) — keys

V = X W_V (n × d_v) — values

where W_Q, W_K ∈ R^d×d_k and W_V ∈ R^d×d_v are learned projection matrices.

Why three separate projections? The same token plays different roles: "cat" as a query asks "what did I do?"; as a key it advertises "I'm an animate noun"; as a value it contributes its semantic content to other tokens' representations. Separating Q, K, V lets the network learn these distinct roles independently.

Self-Attention: Queries Come From the Same Sequence

In self-attention, queries, keys, and values all come from the same input X. Each token queries every other token (including itself). This is how "it" can query the whole sentence and find "cat" as the best match.

Data Flow

Input X (n × d)

n tokens, each d-dimensional

↓ × W_Q, W_K, W_V

Q (n × d_k), K (n × d_k), V (n × d_v)

Three projections of the same input

↓ Q K^T

Scores (n × n)

How much each token attends to every other

↓ softmax × V

Output (n × d_v)

Weighted combination of values

Worked Example

Sequence of 3 tokens, d = 4, d_k = d_v = 2. Token embeddings:

X = [[1, 0, 1, 0], [0, 1, 0, 1], [1, 1, 0, 0]]

Suppose W_Q projects to queries that capture "what am I looking for?" and W_K captures "what do I contain?". After projection (simplified):

Q = [[1, 0], [0, 1], [1, 1]] K = [[0, 1], [1, 0], [1, 1]]

Score matrix QK^T:

QK^T = [[0, 1, 1], [1, 0, 1], [1, 1, 2]]

Token 3's query [1,1] matches key [1,1] (token 3 itself) with score 2, and matches the others with score 1. After softmax, token 3 attends most strongly to itself.

Query-Key Matching

Three tokens with their query (arrow) and key (circle) vectors in 2D. The dot product between query and key determines attention strength. Drag tokens to see how scores change.

In self-attention, Q, K, and V are all computed from:

The same input sequence X, through different learned projections W_Q, W_K, W_V Different input sequences (encoder for K,V; decoder for Q) Random initializations that are fixed after training

Chapter 2: Scaled Dot-Product Attention

Now we assemble the full attention formula. The output for each query is a weighted sum of values, where weights are determined by query-key similarity:

Attention(Q, K, V) = softmax(Q K^T / √d_k) V

Let's break this apart piece by piece.

Step 1: Compute Scores

S = Q K^T (n × n matrix)

Entry S_ij = q_i^T k_j measures how much query i is compatible with key j. Higher dot product = better match = more attention.

Step 2: Scale

S̃ = S / √d_k

Why scale? Without it, for large d_k, the dot products grow in magnitude (variance of q^Tk is proportional to d_k if q and k have unit variance entries). Large values push softmax into saturation, where gradients are tiny. Dividing by √d_k keeps the variance at 1, maintaining softmax in its useful range.

The scaling factor demystified. If each component of q and k is drawn independently with mean 0 and variance 1, then E[q^Tk] = 0 and Var(q^Tk) = d_k. After dividing by √d_k, the variance becomes 1. This keeps softmax inputs in the range where it produces non-degenerate probability distributions (not all mass on one token).

Step 3: Softmax

A = softmax(S̃) (applied row-wise)

Each row of A is a probability distribution: A_ij = exp(S̃_ij) / ∑_j' exp(S̃_ij'). Row i tells us how token i distributes its attention across all tokens.

Step 4: Weighted Sum of Values

Output = A V (n × d_v)

Output token i is ∑_j A_ij V_j — a convex combination of all value vectors, weighted by attention.

Complete Numerical Example

Three tokens, d_k = 2, d_v = 2.

Q = [[1, 0], [0, 1], [1, 1]] K = [[1, 1], [0, 1], [1, 0]] V = [[1, 2], [3, 4], [5, 6]]

Scores: QK^T = [[1, 0, 1], [1, 1, 0], [2, 1, 1]]

Scale: ÷ √2 = [[0.71, 0, 0.71], [0.71, 0.71, 0], [1.41, 0.71, 0.71]]

Softmax (row-wise):

A₁ = [0.39, 0.22, 0.39]

A₂ = [0.39, 0.39, 0.22]

A₃ = [0.49, 0.26, 0.26]

Output: O₁ = 0.39[1,2] + 0.22[3,4] + 0.39[5,6] = [3.00, 4.00]. Token 1 attends equally to tokens 1 and 3 (both had score 0.71), drawing from both their values.

Attention Score Computation

See the full attention computation step by step. Adjust the scaling factor to see why √d_k matters: without it, softmax saturates.

Scale factor 1.41

The scaling by 1/√d_k in attention is needed because:

It reduces the number of parameters Without it, dot products grow with d_k, pushing softmax into saturation where gradients vanish It makes the computation faster

Chapter 3: Multi-Head Attention

A single attention head computes one set of attention weights. But language (and signals) have multiple types of relationships simultaneously. In "The cat sat on the mat because it was tired":

Relationship type	Connection
Coreference	"it" → "cat"
Spatial	"sat" → "mat"
Causal	"tired" → "because"
Subject-verb	"cat" → "sat"

One attention pattern can't capture all these simultaneously. Solution: run multiple attention heads in parallel, each with its own W_Q, W_K, W_V projections.

The Multi-Head Formula

For H heads, each head h uses its own projections:

head_h = Attention(X W_Q^h, X W_K^h, X W_V^h)

Each projection maps to dimension d_k = d_v = d/H (so the total computation is the same as a single d-dimensional head).

Concatenate all heads and project back:

MultiHead(X) = [head₁; head₂; ...; head_H] W_O

where W_O ∈ R^d×d mixes information across heads.

Parameter Count

For model dimension d = 512 with H = 8 heads:

Component	Shape	Parameters
W_Q^h (per head)	512 × 64	32,768
W_K^h (per head)	512 × 64	32,768
W_V^h (per head)	512 × 64	32,768
All 8 heads (Q+K+V)		8 × 3 × 32,768 = 786,432
W_O	512 × 512	262,144
Total		~1M parameters

Heads specialize. After training, different heads learn to attend to different linguistic/signal properties. In language models, researchers have found heads that track: subject-verb agreement, positional proximity, rare word detection, and syntactic arcs. Each head "looks at the data differently" through its learned projections.

Multi-Head Attention Visualization

Four heads attend to the same sentence. Each head learns a different attention pattern. Click a head to highlight its connections.

Active head Head 1: coreference

Multi-head attention with H=8 heads and model dimension d=512 uses per-head dimension:

d_k = d/H = 512/8 = 64, so total computation matches a single full-dimensional head d_k = 512 per head (H × d total, much more expensive) d_k = 8 per head

Chapter 4: The Transformer Block

The Transformer (Vaswani et al., 2017) is not just attention — it's a specific architecture that wraps multi-head attention in a block with feed-forward networks, residual connections, and layer normalization. This block is then stacked L times.

One Transformer Block

Input X (n × d)

Token embeddings + positional encoding

↓

Layer Norm

Normalize each token independently

↓

Multi-Head Self-Attention

Each token attends to all others

↓ + X (residual)

Layer Norm

Normalize again

↓

Feed-Forward Network (FFN)

Two linear layers with ReLU: d → 4d → d

↓ + (residual)

Output (n × d)

Same shape as input — blocks can stack

Why Each Component Exists

Multi-head attention: Lets tokens exchange information. Without this, each token only knows about itself.

Feed-forward network (FFN): Processes each token independently (same FFN applied to each position). This is where per-token computation happens — think of it as "thinking about what I've gathered from attention." The expansion to 4d and back allows the network to compute complex functions of the attended information.

Residual connections: Same purpose as in ResNets — gradient flow and easy identity learning. Without them, stacking 96 layers (GPT-3) would be impossible to train.

Layer normalization: Like BatchNorm but computed per-token (across the d-dimensional feature, not across the batch). Stabilizes training without depending on batch size.

The Transformer's superpower is composability. Because input and output have the same shape (n × d), blocks stack trivially. Each additional block lets tokens gather information from tokens that have already gathered information — enabling multi-hop reasoning. GPT-4 uses ~120 blocks; each one refines the representation.

Complete Equations for One Block

X' = X + MultiHead(LayerNorm(X))

Output = X' + FFN(LayerNorm(X'))

FFN(z) = W₂ · ReLU(W₁ z + b₁) + b₂

where W₁ ∈ R^4d×d and W₂ ∈ R^d×4d.

Parameter Count (One Block, d=512, H=8)

Component	Parameters
Multi-head attention (Q, K, V, O)	~1.05M
FFN (W₁, b₁, W₂, b₂)	512×2048 + 2048 + 2048×512 + 512 = ~2.1M
Layer norms (2 ×)	2 × 2 × 512 = 2,048
Total per block	~3.15M
6-layer Transformer	~19M

Interactive Transformer Block

Click each component to see data flow through a Transformer block. Watch how the residual connections preserve information while attention and FFN add to it.

Click to step through

The feed-forward network in a Transformer block processes:

All tokens jointly (mixing information across positions) Each token independently with the same weights (position-wise) Only the output of the attention layer, not the residual

Chapter 5: Signal Processing Applications

The Transformer was invented for machine translation (2017), but its architecture is fundamentally about processing sequences — and signals are sequences. Within 3 years, Transformers achieved state-of-the-art results across signal processing.

Audio: Speech Recognition & Enhancement

Traditional pipeline: audio → STFT → mel filterbank → CNN/RNN features → decoder. The Transformer replaces the entire CNN/RNN stack:

Input: sequence of mel-spectrogram frames (each frame is a "token"). Self-attention lets frame 50 attend directly to frame 5 — capturing long-range dependencies like repeated words or prosodic patterns across an entire utterance.

Whisper (OpenAI, 2022): Transformer encoder-decoder trained on 680K hours of audio. The encoder uses self-attention over spectrogram frames. The decoder uses cross-attention to attend to encoder outputs while generating text token by token.

Music: Source Separation

Separating vocals from instruments requires understanding long-range structure (verse patterns, repetitions). Attention captures this: the vocal pattern at measure 5 should be similar to measure 13 (same melody). A CNN would need very deep networks to see across 8 measures.

Images: Vision Transformers (ViT)

Split an image into 16×16 patches. Flatten each patch into a vector. Treat patches as "tokens" and apply a standard Transformer. Self-attention lets distant patches communicate directly — a patch in the top-left can attend to a patch in the bottom-right without any pooling hierarchy.

Time Series Forecasting

For sensor signals, financial data, or EEG: each time step is a token. Attention learns which past time steps are most predictive of the future. Unlike an AR model (fixed window), attention can learn to attend to periodic events (e.g., heartbeats) regardless of variable spacing.

The common thread. In every domain, the same pattern emerges: (1) Tokenize the signal into a sequence. (2) Apply self-attention to let each token gather relevant context from the entire sequence. (3) Stack blocks to build hierarchical understanding. The same architecture handles text, audio, images, and sensor data with minimal domain-specific changes.

Attention on a Time Series

A periodic signal with anomalies. The attention weights show which past time steps are most relevant for predicting the highlighted position. Notice it attends to the same phase of previous cycles.

Query position 45

The key advantage of Transformers over CNNs for audio signals is:

Any frame can directly attend to any other frame in one layer, capturing long-range dependencies (repeated words, prosody) without needing many stacked layers Transformers use fewer parameters than CNNs Transformers are faster to train on audio data

Chapter 6: Text-Conditioned Diffusion & Model Compression

Two frontier applications that build on attention: (1) generating signals from text descriptions, and (2) making large models smaller without losing quality.

Text-Conditioned Diffusion

Diffusion models generate data by learning to reverse a noise-adding process. Starting from pure noise, the model predicts and removes noise step by step until a clean signal (image, audio) emerges.

How does text conditioning work? Cross-attention. The diffusion model's intermediate features are the queries; the text embedding (from a separate text encoder) provides keys and values:

Q = Z W_Q (from noisy signal features)

K = T W_K, V = T W_V (from text embedding T)

CrossAttention = softmax(QK^T / √d_k) V

Each spatial position in the generated image attends to relevant words in the text prompt. "A red car on a beach" → car pixels attend to "red car," sky pixels attend to "beach." This is how Stable Diffusion, DALL-E, and MusicGen work.

Cross-attention = the bridge between modalities. Text and images live in different spaces. Cross-attention lets one modality's features (Q) query another modality's representations (K, V). The same mechanism works for audio-text (MusicGen), video-text (Sora), and any cross-modal generation.

Model Compression

Large Transformers (GPT-4: ~1.8T parameters) are expensive to deploy. Three compression strategies leverage the structure of attention:

Pruning attention heads. Not all heads are useful after training. Many can be removed with minimal quality loss. Some heads are redundant; others attend to trivial patterns (always attend to the previous token).

Knowledge distillation. Train a small "student" Transformer to match the attention patterns and outputs of a large "teacher." The student learns which attention patterns matter most.

Low-rank factorization. The attention matrix A = softmax(QK^T/√d_k) is n×n but often has effective rank much lower than n. Low-rank approximations reduce computation from O(n²) to O(n·r) where r « n.

Method	Compression	Trade-off
Head pruning	Remove 30-50% of heads	~1% accuracy loss
Distillation	6× smaller model	~3% accuracy loss
Low-rank attention	O(n) instead of O(n²)	Approximation error on long contexts
Quantization	16-bit → 4-bit weights	Minimal loss with careful calibration

Cross-Attention: Text → Signal

A text prompt conditions signal generation via cross-attention. See which text tokens each signal position attends to most strongly.

In text-conditioned diffusion (e.g., Stable Diffusion), the text influences the generated image through:

Cross-attention, where image features are queries and text embeddings provide keys and values Concatenating text and image into one sequence for self-attention Adding text embedding directly to the noise

Chapter 7: Mastery

From the "it" problem to generating images from text — attention is the mechanism that unlocked it all. Let's consolidate.

Component	Formula	Purpose
Query/Key/Value	Q = XW_Q, K = XW_K, V = XW_V	Project into comparison space
Scaled dot-product	softmax(QK^T/√d_k)V	Content-based weighted sum
Multi-head	Concat(head₁,...,head_H)W_O	Multiple relationship types
Residual + LayerNorm	X + Sublayer(LN(X))	Gradient flow + stability
FFN	W₂ ReLU(W₁x)	Per-token nonlinear processing
Cross-attention	Q from signal, K/V from text	Condition on another modality

The Transformer's Impact on EE

EE Domain	Before Transformers	With Transformers
Speech recognition	HMM + CNN/RNN	Whisper (Transformer encoder-decoder)
Audio generation	WaveNet (dilated CNN)	MusicGen (Transformer + cross-attention)
Signal separation	NMF, deep clustering	SepFormer (dual-path attention)
Radar/sonar	CFAR + matched filter	Attention on range-Doppler maps
Channel estimation	LS, MMSE	Transformer on pilot symbols

Computational Cost: The Quadratic Bottleneck

Self-attention is O(n²) in sequence length (every token attends to every other). For n = 100K tokens, that's 10 billion attention scores. Active research: linear attention, sparse attention, sliding window attention, and state-space models (Mamba) aim to maintain the benefits while reducing to O(n) or O(n log n).

Related lessons.
• Lecture 23: Neural Networks — foundations of forward/backward pass
• Lecture 24: Deep Learning & CNNs — residual connections, depth
• Lecture 8: STFT — spectrograms that Transformers process

"Attention is all you need." — Vaswani et al., 2017