Workbook — NLP Fundamentals (CS224N)

NLP Fundamentals

From word vectors to text classifiers — the core NLP toolkit derived by hand. Word2Vec, parsing, attention, tokenization, pretraining, fine-tuning, and evaluation — all solvable in-browser with instant feedback.

Prerequisites: Basic linear algebra (dot products, matrix shapes) + Probability (softmax, log likelihood). That's it.
10
Chapters
57
Exercises
5
Exercise Types
Mastery
0 / 57 exercises (0%)
0
Day Streak
Best: 0

Chapter 0: Word Vectors

How do you tell a computer that "king" and "queen" are related but "king" and "toaster" are not? You can't just alphabetize — "cat" and "car" are close alphabetically but worlds apart semantically. The insight behind word vectors is to represent each word as a point in a high-dimensional space where distance equals meaning.

In Word2Vec (skipgram), you train a model to predict context words from a center word. The key idea: if two words appear in similar contexts, they end up with similar vectors. The training window defines "context" — for window size w, context words are the w words before and after the center word.

Skipgram context extraction:
Given sentence: w1 w2 ... wn
Center word: wc
Context (window=m): { wc-m, ..., wc-1, wc+1, ..., wc+m }

Similarity:
Dot product: u · v = ∑i uivi
Cosine similarity: cos(θ) = (u · v) / (||u|| · ||v||)
The distributional hypothesis. "You shall know a word by the company it keeps" (Firth, 1957). Word2Vec turns this linguistic insight into geometry: words that share contexts become neighbors in vector space. This is why king - man + woman ≈ queen works — the "royalty" direction is orthogonal to the "gender" direction.
Exercise 0.1: Context Words Trace
Sentence: "the cat sat on the mat". Center word: "sat" (position 3, 1-indexed). Window size = 2. Which words are the context words?
Show explanation

With window=2, we take 2 words to the left and 2 words to the right of the center word "sat" (position 3).

Left context: positions 1,2 → "the", "cat"
Right context: positions 4,5 → "on", "the"
Context = {the, cat, on, the}

Note: the center word is never included in its own context. Also note "the" appears twice — once from the left, once from the right. In practice Word2Vec treats these as separate training pairs: (sat, the), (sat, cat), (sat, on), (sat, the).

Exercise 0.2: Dot Product Similarity Derive

Given vectors: vcat = [0.5, 0.8, -0.2], vdog = [0.4, 0.9, -0.1]. Compute the dot product vcat · vdog.

(scalar)
Show derivation
vcat · vdog = (0.5)(0.4) + (0.8)(0.9) + (-0.2)(-0.1)
= 0.20 + 0.72 + 0.02 = 0.94

A positive dot product means the vectors point in roughly the same direction — the words are semantically similar. Compare this to vcat · vtoaster which would be near zero for unrelated words.

Exercise 0.3: Cosine Similarity Derive

Using the same vectors: vcat = [0.5, 0.8, -0.2], vdog = [0.4, 0.9, -0.1]. Compute the cosine similarity (round to 2 decimal places).

Compute ||vcat|| = √(0.25 + 0.64 + 0.04), ||vdog|| = √(0.16 + 0.81 + 0.01), then divide dot product by the product of norms.

(between -1 and 1)
Show derivation
||vcat|| = √(0.25 + 0.64 + 0.04) = √0.93 ≈ 0.9644
||vdog|| = √(0.16 + 0.81 + 0.01) = √0.98 ≈ 0.9899
cos(θ) = 0.94 / (0.9644 × 0.9899) = 0.94 / 0.9547 ≈ 0.984

Cosine similarity of 0.98 — extremely similar! Cosine similarity normalizes for vector length, so it measures direction only. Two vectors with cosine similarity = 1.0 point the same way regardless of magnitude. This is why NLP prefers cosine over raw dot product for comparing embeddings.

Exercise 0.4: Implement cosineSimilarity() Build

Write a function that takes two arrays (vectors) of equal length and returns their cosine similarity.

Return a single number between -1 and 1.
Show solution
javascript
function cosineSimilarity(a, b) {
  let dot = 0, normA = 0, normB = 0;
  for (let i = 0; i < a.length; i++) {
    dot += a[i] * b[i];
    normA += a[i] * a[i];
    normB += b[i] * b[i];
  }
  return dot / (Math.sqrt(normA) * Math.sqrt(normB));
}
Exercise 0.5: Analogy Arithmetic Derive

Given: vking = [0.8, 0.6, 0.9], vman = [0.3, 0.1, 0.7], vwoman = [0.3, 0.5, 0.7]. Compute vresult = vking - vman + vwoman. What is the first component of vresult?

(first component)
Show derivation
vresult = [0.8, 0.6, 0.9] - [0.3, 0.1, 0.7] + [0.3, 0.5, 0.7]
= [0.8-0.3+0.3, 0.6-0.1+0.5, 0.9-0.7+0.7]
= [0.8, 1.0, 0.9]

The result [0.8, 1.0, 0.9] should be closest to vqueen in the embedding space. Notice the first component stayed at 0.8 (the "royalty" axis), while the second component jumped from 0.6 to 1.0 (the "femininity" axis). This is the geometric magic of word vector arithmetic: king - man + woman ≈ queen.

Exercise 0.6: Training Pairs Derive

Sentence: "I love deep learning models". Window size = 2. How many total (center, context) training pairs does Word2Vec skipgram generate?

Count each (center, context) pair. Edge words have fewer context words since the window gets clipped at sentence boundaries.

training pairs
Show derivation

For each word, count how many context words are within the window:

"I" (pos 1): right only = {love, deep} → 2 pairs
"love" (pos 2): left = {I}, right = {deep, learning} → 3 pairs
"deep" (pos 3): left = {I, love}, right = {learning, models} → 4 pairs
"learning" (pos 4): left = {love, deep}, right = {models} → 3 pairs
"models" (pos 5): left only = {deep, learning} → 2 pairs
Total = 2 + 3 + 4 + 3 + 2 = 14 pairs

"love" gets only 3 context words (not 4) because the window clips at the left boundary — there's only 1 word to its left. Edge effects reduce total pairs. The general formula: for a sentence of length n with window m, each word at position i produces min(m, i-1) + min(m, n-i) pairs.

Chapter 1: Word2Vec Gradients

Word2Vec learns by gradient descent, so we need to understand the objective function and its gradients. The skipgram model defines the probability of observing a context word o given center word c using the softmax over the entire vocabulary:

Skipgram probability (softmax):
P(o | c) = exp(uo · vc) / ∑w∈V exp(uw · vc)

Loss for one (center, context) pair:
J = -log P(o | c) = -(uo · vc) + log ∑w∈V exp(uw · vc)

Gradient w.r.t. vc:
∂J / ∂vc = -uo + ∑w∈V P(w|c) · uw
// observed context vector minus expected context vector
The gradient has a beautiful interpretation. ∂J/∂vc = -(uo - E[uw]). The gradient pushes vc toward the observed context word uo and away from the expected (probability-weighted average) context word. Training = making the center vector more like its actual neighbors.
Exercise 1.1: Softmax Probability Derive

Vocabulary: {a, b, c, d, e}. Center word "c" with vc = [1, 0]. Context vectors: ua=[1,1], ub=[0,1], uc=[1,0], ud=[-1,0], ue=[0,-1]. Compute P(a | c).

First compute all dot products uw · vc, then apply softmax.

(probability)
Show derivation
ua · vc = [1,1] · [1,0] = 1
ub · vc = [0,1] · [1,0] = 0
uc · vc = [1,0] · [1,0] = 1
ud · vc = [-1,0] · [1,0] = -1
ue · vc = [0,-1] · [1,0] = 0
exp values: e1=2.718, e0=1, e1=2.718, e-1=0.368, e0=1
∑ = 2.718 + 1 + 2.718 + 0.368 + 1 = 7.804
P(a|c) = 2.718 / 7.804 ≈ 0.348

Words with higher dot product to the center word get higher probability. ua and uc both have dot product 1, so they share the highest probability. ud has dot product -1, so it gets the lowest probability (0.047). This is exactly what we want: the model says "a" and "c" are likely context words for center word "c".

Exercise 1.2: The Softmax Bottleneck Trace
With a vocabulary of V = 50,000 words, computing the exact softmax for one training pair requires how many dot products?
Show explanation

The denominator ∑w∈V exp(uw · vc) requires computing the dot product of vc with every context vector uw in the vocabulary. For V = 50,000, that's 50,000 dot products per training pair. With billions of training pairs, this is prohibitively expensive. This is the softmax bottleneck — and it's exactly why negative sampling was invented.

Exercise 1.3: Negative Sampling Loss Derive

Negative sampling replaces the full softmax with a binary classification. Instead of normalizing over V words, we maximize log σ(uo · vc) + ∑k log σ(-uk · vc) for k negative samples. Given: uo · vc = 2.0, and 2 negative samples with uk1 · vc = -0.5, uk2 · vc = 0.3. Compute the negative sampling loss JNS = -[log σ(2.0) + log σ(0.5) + log σ(-0.3)].

σ(x) = 1/(1+e-x). So σ(2.0)≈0.881, σ(0.5)≈0.622, σ(-0.3)≈0.426.

(loss value)
Show derivation

Positive term: σ(uo · vc) = σ(2.0). Negative terms use σ(-uk · vc):

σ(2.0) = 1/(1+e-2) = 1/1.135 ≈ 0.881
σ(-(-0.5)) = σ(0.5) = 1/(1+e-0.5) ≈ 0.622
σ(-(0.3)) = σ(-0.3) = 1/(1+e0.3) ≈ 0.426
JNS = -(ln 0.881 + ln 0.622 + ln 0.426)
= -(-0.127 - 0.476 - 0.853) = 1.456

The key subtlety: negative samples use σ(-uk · vc), which encourages high sigmoid output when the dot product is negative (i.e., the words are dissimilar). With just k=2 negatives instead of V=50,000 terms in the softmax denominator, training is orders of magnitude faster.

Exercise 1.4: Gradient Computation Derive

Using the 5-word vocabulary from Exercise 1.1, with the observed context word being "a": compute the gradient ∂J/∂vc = -uo + ∑w P(w|c) · uw. What is the first component of the gradient vector?

Use the probabilities: P(a|c)=0.348, P(b|c)=0.128, P(c|c)=0.348, P(d|c)=0.047, P(e|c)=0.128. The expected vector E[uw] = ∑ P(w|c) · uw.

(first component of gradient)
Show derivation

First compute E[uw] (first component only):

E[uw]1 = 0.348(1) + 0.128(0) + 0.348(1) + 0.047(-1) + 0.128(0)
= 0.348 + 0 + 0.348 - 0.047 + 0 = 0.649
∂J/∂vc (first component) = -ua,1 + E[uw]1 = -1 + 0.649 = -0.351

The gradient is negative in the first component, meaning the update will increase vc in that direction — pushing vc toward the observed context word ua. The gradient literally says: "move the center vector closer to the word you actually observed."

Exercise 1.5: Negative Sampling Efficiency Trace
Negative sampling with k=15 on a vocabulary of V=100,000. By what factor does negative sampling reduce computation per training step compared to full softmax?
Show explanation
Full softmax: V = 100,000 dot products per step
Negative sampling: 1 positive + 15 negatives = 16 dot products per step
Speedup = 100,000 / 16 = 6,250×

This is why negative sampling made Word2Vec practical. The original paper (Mikolov et al., 2013) reported that k=5-20 negative samples works well for small datasets, while k=2-5 suffices for very large corpora. The quality barely degrades because most words in the vocabulary are irrelevant to any given center word — you only need a small sample of negatives to approximate the full gradient.

Exercise 1.6: Find the Bug Debug

This negative sampling loss implementation has a subtle bug. Click the buggy line.

function negSamplingLoss(vc, uo, negVecs) {
  let loss = -Math.log(sigmoid(dot(uo, vc)));
  for (let k = 0; k < negVecs.length; k++) {
    loss -= Math.log(sigmoid(dot(negVecs[k], vc)));
  }
  return loss;
}
Show explanation

Line 4 is the bug. For negative samples, we want σ(-uk · vc), not σ(uk · vc). The correct line should be: loss -= Math.log(sigmoid(-dot(negVecs[k], vc)));

Without the negation, the loss encourages high similarity with negative samples instead of discouraging it. The model would learn to make all words similar — the exact opposite of what we want.

Chapter 2: Dependency Parsing

How does a computer figure out that in "the cat sat on the mat," the word "sat" is the main action, "cat" is the one doing it, and "mat" is where it happened? Dependency parsing builds a tree that connects each word to the word it modifies. The result is a directed graph where every word except the root has exactly one parent.

The shift-reduce parser uses three operations to build this tree incrementally. It maintains a stack (words being processed), a buffer (words waiting), and a set of arcs (the dependency edges found so far).

Arc-standard transitions:
SHIFT: move the front of the buffer onto the stack
LEFT-ARC(r): the top of stack becomes parent of second-on-stack. Remove second-on-stack.
RIGHT-ARC(r): second-on-stack becomes parent of top-of-stack. Remove top-of-stack.

For a sentence of n words:
Total transitions = 2n (n shifts + n arcs)
Why shift-reduce? A naive parser would try all possible trees — O(n!) combinations. Shift-reduce builds the tree in exactly 2n steps by making greedy local decisions. A neural network scores each possible transition at each step. This is why neural dependency parsing (Chen & Manning, 2014) was a breakthrough: O(n) parsing with neural-network-quality decisions.
Exercise 2.1: Trace the Parser (Step 1) Trace
Sentence: "The cat sat on the mat". Initial state: Stack = [ROOT], Buffer = [The, cat, sat, on, the, mat]. After performing SHIFT twice, what is the stack?
Show trace
Initial: Stack = [ROOT], Buffer = [The, cat, sat, on, the, mat]
SHIFT: Stack = [ROOT, The], Buffer = [cat, sat, on, the, mat]
SHIFT: Stack = [ROOT, The, cat], Buffer = [sat, on, the, mat]

Each SHIFT moves the front word from the buffer onto the top of the stack. The stack grows from left to right — the rightmost element is the "top."

Exercise 2.2: Trace the Parser (LEFT-ARC) Trace
Continuing from the state: Stack = [ROOT, The, cat], Buffer = [sat, on, the, mat]. We perform LEFT-ARC(det) — meaning "cat" is the parent of "The" with label "det". What is the new stack?
Show explanation

LEFT-ARC: the top of stack (cat) becomes the parent of the second-on-stack (The). The second-on-stack (The) is removed. The arc is cat → The (parent → child), labeled "det" for determiner.

Stack before: [ROOT, The, cat]
LEFT-ARC(det): creates arc cat ⟶det The, removes "The"
Stack after: [ROOT, cat]

Think of LEFT-ARC as: "the word below me on the stack depends on me." The arrow points from parent to child: cat → The. This correctly says "The" is a determiner of "cat."

Exercise 2.3: Total Transitions Derive

"The cat sat on the mat" has 6 words. How many total transitions (SHIFT + ARC operations) does the shift-reduce parser need?

transitions
Show derivation
n = 6 words
Total transitions = 2n = 2 × 6 = 12

Every word must be shifted once (6 shifts) and every word must be attached to exactly one parent (6 arcs, one for each word including ROOT attachment). This gives exactly 2n transitions. A neural parser with a scoring network that runs in O(1) per transition thus parses in O(n) total time — linear in sentence length.

Exercise 2.4: Full Parse Trace Design

Put these transitions in the correct order to parse "The cat sat":

?
?
?
?
?
?
SHIFT "The" SHIFT "cat" LEFT-ARC(det) SHIFT "sat" LEFT-ARC(nsubj) RIGHT-ARC(root)
Show explanation

Correct order: SHIFT "The" → SHIFT "cat" → LEFT-ARC(det) → SHIFT "sat" → LEFT-ARC(nsubj) → RIGHT-ARC(root)

1. SHIFT: Stack=[ROOT, The], Buf=[cat, sat]
2. SHIFT: Stack=[ROOT, The, cat], Buf=[sat]
3. LEFT-ARC(det): cat→The, Stack=[ROOT, cat], Buf=[sat]
4. SHIFT: Stack=[ROOT, cat, sat], Buf=[]
5. LEFT-ARC(nsubj): sat→cat, Stack=[ROOT, sat], Buf=[]
6. RIGHT-ARC(root): ROOT→sat, Stack=[ROOT], Buf=[]

The final tree: ROOT → sat, sat → cat (nsubj), cat → The (det). This correctly captures that "sat" is the main verb, "cat" is the subject, and "The" modifies "cat."

Exercise 2.5: Parser Complexity Trace
A sentence has 40 words. A neural shift-reduce parser scores each transition with a feedforward network that takes O(1). A graph-based parser scores all O(n²) possible arcs then finds the best tree. Which is faster for this sentence?
Show explanation

Shift-reduce: O(n) = 80 steps. Graph-based (Eisner/MST): O(n²) or O(n³) = 1,600 or 64,000 steps. Shift-reduce wins on speed, but graph-based can find the globally optimal tree while shift-reduce makes greedy local decisions. In practice, beam search (keeping top-k candidates at each step) helps shift-reduce approach global optimality while staying fast.

Chapter 3: Language Model Perplexity

You've trained two language models and both look reasonable. How do you decide which one is better? You can't just eyeball their generated text — you need a number. Perplexity is that number. It measures how "surprised" the model is by real text. A model that assigns high probability to actual sentences has low perplexity.

Perplexity:
PP(W) = exp( -1/N ∑i=1N log P(wi | w1...wi-1) )

Equivalently:
PP(W) = ( ∏i=1N P(wi | w1...wi-1) )-1/N

Interpretation:
PP = weighted average branching factor
PP = 10 means the model is as uncertain as choosing among 10 equally likely words
Perplexity is the exponential of cross-entropy loss. When you train a language model with cross-entropy loss and get a validation loss of 3.5, the perplexity is exp(3.5) = 33.1. This means the model is, on average, as confused as if it were picking uniformly from ~33 words at each position. Modern LLMs achieve perplexities of 5-15 on typical benchmarks.
Exercise 3.1: Compute Perplexity Derive

A language model assigns these probabilities to a 5-word sentence: P(w1)=0.1, P(w2|w1)=0.3, P(w3|w1:2)=0.5, P(w4|w1:3)=0.2, P(w5|w1:4)=0.4. Compute the perplexity.

perplexity
Show derivation
-1/N ∑ log P = -1/5 [log(0.1) + log(0.3) + log(0.5) + log(0.2) + log(0.4)]
= -1/5 [-2.303 + (-1.204) + (-0.693) + (-1.609) + (-0.916)]
= -1/5 × (-6.725) = 1.345
PP = exp(1.345) ≈ 3.84

Alternative computation via the product formula: (0.1 × 0.3 × 0.5 × 0.2 × 0.4)-1/5 = (0.0012)-0.2 = (1/0.0012)0.2 = 833.30.2 ≈ 3.84. Both methods give the same answer. The model is about as confused as picking from ~4 equally likely options.

Exercise 3.2: Cross-Entropy to Perplexity Derive

Your language model reports a validation cross-entropy loss of 2.8 (nats, i.e. natural log base). What is the perplexity?

perplexity
Show derivation
PP = exp(cross-entropy) = exp(2.8) = e2.8 ≈ 16.44

This is the most common conversion you'll do in practice. Training loss of 2.8 → perplexity of ~16.4. For reference: GPT-2 achieved ~29.4 perplexity on WikiText-103, GPT-3 improved to ~20.5, and current frontier models are in the 5-10 range on similar benchmarks.

Exercise 3.3: Which Model is Better? Trace
Model A has perplexity 22.1 on the test set. Model B has perplexity 15.3. Which model is better, and what does the difference mean concretely?
Show explanation

Lower perplexity = better model. Model B (PP=15.3) assigns higher probability to the test text than Model A (PP=22.1). Concretely: at each word position, Model B is as uncertain as choosing from ~15 options, while Model A is as uncertain as choosing from ~22 options. Model B has a tighter, more accurate distribution over next words.

Caveat: perplexity must be compared on the same test set with the same vocabulary. A model with vocabulary size 1000 will naturally have lower perplexity than one with vocabulary 50000 on the same text, simply because it has fewer choices.

Exercise 3.4: Implement perplexity() Build

Write a function that takes an array of token probabilities and returns the perplexity.

Use Math.log (natural log) and Math.exp.
Show solution
javascript
function perplexity(probs) {
  const N = probs.length;
  let sumLogP = 0;
  for (let i = 0; i < N; i++) {
    sumLogP += Math.log(probs[i]);
  }
  return Math.exp(-sumLogP / N);
}
Exercise 3.5: Perplexity Lower Bound Derive

A uniform language model assigns probability 1/V to every word regardless of context, where V is the vocabulary size. If V = 50,000, what is the perplexity of this model?

perplexity
Show derivation
PP = exp(-1/N ∑ log(1/V)) = exp(-1/N × N × log(1/V))
= exp(-log(1/V)) = exp(log(V)) = V = 50,000

The uniform model has perplexity exactly equal to the vocabulary size. This is the worst-case baseline — any model that learns anything about language should beat this. For V=50,000, the model is maximally confused, picking randomly from all 50,000 words at each position. A perplexity of 15 means the model has reduced its effective choices from 50,000 to 15 — a 3,333× improvement.

Chapter 4: Attention Mechanisms

Before transformers, seq-to-seq models compressed the entire input sentence into a single fixed-length vector. For long sentences, this bottleneck destroyed information. Attention (Bahdanau et al., 2014) lets the decoder look back at all encoder states and focus on the most relevant ones for each output word.

Bahdanau (additive) attention:
eij = vT tanh(W1si + W2hj)  // alignment score
αij = softmax(eij) = exp(eij) / ∑k exp(eik)  // attention weight
ci = ∑j αij hj  // context vector

Luong (multiplicative) attention:
eij = siT W hj  // or just siThj (dot product)
Attention = soft dictionary lookup. Think of the encoder states as key-value pairs and the decoder state as a query. Attention computes a relevance score between the query and each key, normalizes with softmax, and returns a weighted sum of values. This is exactly the Q, K, V framework that transformers generalize.
Exercise 4.1: Alignment Scores Derive

Three encoder states (source words): h1=[1,0], h2=[0,1], h3=[1,1]. One decoder state: s=[0.5, 0.8]. Using dot-product attention (Luong, no W matrix), compute the alignment score e1 = s · h1.

(score)
Show derivation
e1 = s · h1 = [0.5, 0.8] · [1, 0] = 0.5 × 1 + 0.8 × 0 = 0.5
Exercise 4.2: All Scores + Softmax Derive

Using the same vectors, compute all three alignment scores and then the attention weights α1, α2, α3 via softmax. What is α3 (the weight on h3=[1,1])?

(attention weight)
Show derivation
e1 = [0.5,0.8] · [1,0] = 0.5
e2 = [0.5,0.8] · [0,1] = 0.8
e3 = [0.5,0.8] · [1,1] = 1.3
exp(0.5) = 1.649, exp(0.8) = 2.226, exp(1.3) = 3.669
∑ = 1.649 + 2.226 + 3.669 = 7.544
α1 = 1.649/7.544 = 0.219
α2 = 2.226/7.544 = 0.295
α3 = 3.669/7.544 = 0.486

h3 gets the highest attention weight (~49%) because it has the largest dot product with the decoder state. This makes sense: h3=[1,1] is the most aligned with s=[0.5,0.8] — it's the source word the decoder "wants" to look at most.

Exercise 4.3: Context Vector Derive

Using the attention weights from 4.2 (α1≈0.219, α2≈0.295, α3≈0.486), compute the context vector c = ∑ αjhj. What is the first component of c?

(first component)
Show derivation
c1 = 0.219 × 1 + 0.295 × 0 + 0.486 × 1 = 0.705
c2 = 0.219 × 0 + 0.295 × 1 + 0.486 × 1 = 0.781
c = [0.705, 0.781]

The context vector is a weighted average of the encoder states. Since h3 has the highest weight, c is pulled strongly toward h3=[1,1]. This vector summarizes "what the decoder should know about the input" for generating the current output word.

Exercise 4.4: Bahdanau vs Luong Trace
Bahdanau attention uses an additive score e = vTtanh(W1s + W2h) with learnable W1, W2, v. Luong attention uses e = sTWh (or just sTh for dot). Which statement is correct?
Show explanation

Bahdanau uses a tanh nonlinearity, so it can capture non-linear relationships between query and key. But this prevents efficient batching — you can't compute all scores with a single matrix multiply. Luong's dot product can be computed as S × HT, a single GEMM call that GPUs love. This is why transformers chose scaled dot-product attention: QKT/√dk is just Luong attention with a scaling factor to prevent softmax saturation.

Exercise 4.5: Which Source Word Gets Attention? Trace
You're translating "Le chat noir" (The black cat) to English. The decoder is generating the word "cat". Encoder states correspond to "Le" (the), "chat" (cat), "noir" (black). Which encoder state should get the highest attention weight?
Show explanation

"chat" is the French word for "cat," so it should receive the highest attention weight when generating "cat." This is exactly what attention was designed for: it learns to align source and target words. In French-to-English translation, attention handles word reordering naturally — "Le chat noir" becomes "The black cat" by attending to the right word at each step, regardless of position differences between the languages.

Exercise 4.6: Scaled Dot-Product Derive

Transformers use scaled dot-product: score = q · k / √dk. If dk = 64 and q · k = 16, what is the scaled score?

(scaled score)
Show derivation
score = q · k / √dk = 16 / √64 = 16 / 8 = 2.0

Without scaling, dot products grow proportionally to dk (expected magnitude ≈ √dk for random vectors). Large dot products push softmax into saturated regions where gradients vanish. Dividing by √dk keeps the variance at 1, regardless of dimension. The original "Attention Is All You Need" paper (Vaswani et al., 2017) explains this in Section 3.2.1.

Chapter 5: Subword Tokenization

Word-level tokenization has a fatal flaw: what happens when the model encounters a word it's never seen? Out-of-vocabulary (OOV) words are replaced with a generic [UNK] token, destroying meaning. "Unforgettable" becomes [UNK]. The solution is subword tokenization: break words into smaller pieces that cover all possible words.

Byte-Pair Encoding (BPE) starts with individual characters and iteratively merges the most frequent pair. After training, the vocabulary contains a mix of characters, subwords ("un", "##ing"), and full words ("the").

BPE Algorithm:
1. Start with character vocabulary + end-of-word symbol
2. Count all adjacent character pairs in corpus
3. Merge the most frequent pair into a new token
4. Repeat step 2-3 for k merge operations

Vocabulary size after k merges:
|V| = |initial chars| + k
Subword tokenization is a spectrum. Character-level: vocabulary of ~256 but sequences are very long. Word-level: short sequences but huge vocabulary and OOV problems. Subword: a sweet spot. GPT-2 uses ~50K BPE tokens, LLaMA uses ~32K SentencePiece tokens. The vocabulary size directly affects the embedding table size (V × d parameters) and the softmax computation over V classes.
Exercise 5.1: BPE Merge Step 1 Trace
Initial vocabulary: {a, b, c, d, _} (where _ is space/end-of-word). Corpus token frequencies: "a b _" appears 5 times, "a b c _" appears 3 times, "d b c _" appears 2 times. The pair counts are: (a,b)=8, (b,_)=5, (b,c)=5, (c,_)=5, (d,b)=2. Which pair is merged first?
Show explanation

BPE always merges the most frequent pair. (a,b) appears 8 times — more than any other pair. After merging, the new token "ab" replaces all occurrences of "a b" in the corpus. The vocabulary becomes {a, b, c, d, _, ab} with 6 tokens.

Exercise 5.2: Vocabulary Size Derive

You start with a character vocabulary of 256 (all bytes). After performing 32,000 merge operations, what is the final vocabulary size?

tokens
Show derivation
|V| = |initial chars| + k = 256 + 32,000 = 32,256

Each merge creates exactly one new token (the concatenation of the merged pair) without removing the original characters. So the vocabulary grows by exactly 1 per merge. This is how you choose your target vocabulary size: set k = desired_V - initial_chars. LLaMA's 32K vocabulary used approximately 32,000 - 256 = 31,744 merges.

Exercise 5.3: BPE Merge Trace Derive

Corpus: "low lower lowest" with word frequencies: "low"=5, "lower"=2, "lowest"=1. Initial tokens (with end-of-word ¤): l o w ¤ (freq 8), e r ¤ (freq 2), e s t ¤ (freq 1). After 3 merge operations, how many tokens are in the vocabulary? Start from initial character set: {l, o, w, e, r, s, t, ¤} = 8 tokens.

tokens
Show derivation

Count pair frequencies across all words (weighted by word frequency):

(l,o): 5+2+1 = 8, (o,w): 5+2+1 = 8, (w,¤): 5, (w,e): 2+1 = 3
(e,r): 2, (r,¤): 2, (e,s): 1, (s,t): 1, (t,¤): 1
Merge 1: (l,o) → "lo" — freq 8 (tied with (o,w), pick first)
Merge 2: (lo,w) → "low" — now the most frequent pair
Merge 3: (low,¤) → "low¤" — freq 5
Vocabulary = {l, o, w, e, r, s, t, ¤, lo, low, low¤} = 8 + 3 = 11
Exercise 5.4: Why Subword > Word-Level? Trace
A word-level model with V=100,000 words encounters the word "ChatGPT-4o" which is not in its vocabulary. A BPE model with V=32,000 tokens has never seen this exact word either. What happens in each case?
Show explanation

BPE's key advantage is that it cannot produce [UNK]. Any string can be decomposed into known subwords — in the worst case, into individual bytes. The model can still compose meaning from pieces: "Chat" (conversation), "GPT" (common in AI context), "4" (number), "o" (suffix). Word-level models are helpless with OOV words, making them brittle against new terminology, typos, and morphological variation.

Exercise 5.5: Embedding Table Impact Derive

Compare the embedding table size for: (A) word-level with V=100,000, d=768 vs (B) BPE with V=32,000, d=768. How many fewer parameters does BPE use (in millions)?

million fewer params
Show derivation
Word-level: 100,000 × 768 = 76,800,000 = 76.8M
BPE: 32,000 × 768 = 24,576,000 = 24.6M
Difference = 76.8M - 24.6M = 52.2M fewer parameters

52M parameters saved just in the embedding table! For GPT-2 Small (124M total), this is ~42% of the entire model. Smaller vocabulary also speeds up the output softmax layer. The tradeoff: BPE sequences are ~20-30% longer than word-level (e.g., "unforgettable" = 1 word-token vs 3 BPE tokens), increasing compute per sentence.

Chapter 6: Pretraining Objectives

The magic of modern NLP is pretraining: train a huge model on a vast corpus with a self-supervised objective, then fine-tune on your specific task. The two dominant objectives are Masked Language Modeling (MLM) used by BERT and Causal Language Modeling (CLM) used by GPT.

MLM (BERT):
Mask 15% of tokens randomly. Of masked tokens:
  80% → replace with [MASK]
  10% → replace with random token
  10% → keep original
Predict original token at masked positions.

CLM (GPT):
Given tokens w1...wt-1, predict wt.
Loss = -∑t=1T log P(wt | w1...wt-1)
// Every position is a prediction target
MLM sees both directions, CLM generates. BERT masks tokens and predicts them using both left and right context — powerful for understanding but can't generate text naturally. GPT predicts left-to-right only — weaker for understanding but can generate fluently. This fundamental tradeoff shaped the entire field: BERT-family for classification/NLU, GPT-family for generation.
Exercise 6.1: MLM Prediction Count Derive

BERT processes a sequence of 512 tokens with 15% masking. How many tokens does the model predict per sequence? And of those, how many see [MASK] vs random token vs original?

masked predictions per sequence
Show derivation
Masked = 0.15 × 512 = 76.8 ≈ 77 tokens
Of those 77: [MASK] = 80% × 77 ≈ 62 tokens
Random replacement = 10% × 77 ≈ 8 tokens
Kept original = 10% × 77 ≈ 7 tokens

Only 77 out of 512 positions generate a loss signal — that's 15% efficiency. CLM generates loss at all 512 positions, making it 6.7x more data-efficient per sequence. This is one reason GPT-style models eventually overtook BERT: they extract more learning from each training example.

Exercise 6.2: CLM vs MLM Efficiency Derive

CLM predicts at all T positions. MLM predicts at ~15% of T positions. For a training corpus of 300B tokens, how many prediction events does each objective produce?

billion MLM predictions
Show derivation
CLM predictions = 300B (one per token position)
MLM predictions = 0.15 × 300B = 45B
CLM/MLM ratio = 300/45 = 6.67× more predictions

CLM gets 6.7x more training signal from the same data. This is misleading though — MLM predictions see bidirectional context, which is arguably a richer learning signal per prediction. The field has debated this tradeoff extensively. Chinchilla (Hoffmann et al., 2022) showed that for CLM, optimal training uses roughly 20 tokens per parameter. For a 7B model, that's ~140B tokens.

Exercise 6.3: Why 80/10/10? Trace
BERT masks 15% of tokens, but of those, only 80% become [MASK] while 10% become random tokens and 10% stay the same. Why not use [MASK] for all 15%?
Show explanation

The core issue is pretrain-finetune mismatch. During fine-tuning, the model never sees [MASK] tokens — all inputs are real words. If the model only learned to predict from [MASK] tokens, it would struggle with real inputs. The 10% "keep original" and 10% "random token" ensure the model also learns to predict from non-[MASK] positions, reducing the distribution shift at fine-tuning time. The BERT paper (Devlin et al., 2019) discusses this in Section 3.1.

Exercise 6.4: Find the Bug Debug

This MLM masking function has a bug. The model trains fine but downstream performance is poor. Click the buggy line.

function applyMLMMask(tokens, maskProb) {
  const labels = tokens.slice();
  for (let i = 0; i < tokens.length; i++) {
    if (Math.random() < maskProb) {
      tokens[i] = MASK_TOKEN;  // always [MASK]
    } else {
      labels[i] = -100;  // ignore in loss
    }
  }
  return { tokens, labels };
}
Show explanation

Line 5 is the bug. All selected positions are replaced with [MASK], ignoring the 80/10/10 split. The correct implementation should: 80% of the time use [MASK], 10% use a random token, and 10% keep the original. Without this split, the model overfits to the [MASK] token distribution and suffers from pretrain-finetune mismatch. The fix:

javascript
const r = Math.random();
if (r < 0.8) tokens[i] = MASK_TOKEN;
else if (r < 0.9) tokens[i] = randomToken();
// else: keep original (10%)
Exercise 6.5: CLM Causal Mask Trace
In CLM, a causal attention mask prevents position t from attending to positions t+1, t+2, ... Why is this necessary?
Show explanation

If position t can see position t+1, then when predicting wt+1, the model already has wt+1 in its context — it just copies the answer instead of learning to predict. The causal mask enforces the autoregressive property: each prediction only uses past tokens. This is also what makes CLM naturally suited for generation: at inference time, the model generates one token at a time, always conditioning only on previous tokens, which exactly matches training.

Chapter 7: Fine-tuning & PEFT

You have a pretrained BERT-base (110M parameters) and want to classify movie reviews as positive or negative. Full fine-tuning updates all 110M parameters — effective but expensive. Parameter-Efficient Fine-Tuning (PEFT) methods freeze most parameters and only train a tiny fraction, achieving similar quality at a fraction of the cost.

Full fine-tuning memory (mixed precision with Adam):
Weights: 2 bytes/param (BF16)
Gradients: 2 bytes/param (BF16)
Adam states: 4 + 4 = 8 bytes/param (FP32 momentum + variance)
Master weights: 4 bytes/param (FP32 copy for optimizer)
Total: ~16 bytes/param

LoRA (Low-Rank Adaptation):
Freeze base weights W [d, d]. Add: W + ΔW = W + BA
A: [d, r], B: [r, d]   // r << d
Trainable params per adapted layer: 2 × d × r
LoRA's key insight: weight updates are low-rank. When fine-tuning a large model on a specific task, the actual change to the weight matrices (ΔW) has much lower rank than d. If ΔW is approximately rank-16, then representing it as BA where B:[r,d] and A:[d,r] with r=16 captures the update with 2×d×r parameters instead of d×d. For d=768, that's 24,576 instead of 589,824 — a 24× reduction per layer.

Exercise 7.1: Full Fine-tuning Memory Derive

BERT-base has 110M parameters. Full fine-tuning with mixed-precision Adam requires ~16 bytes/param. How much GPU memory just for weights + optimizer? (Ignore activations.)

GB
Show derivation
Memory = 110 × 106 × 16 bytes = 1,760,000,000 bytes
= 1,760 MB = 1.76 GB

1.76 GB fits easily on any modern GPU. But for a 7B model: 7 × 109 × 16 = 112 GB — that's more than a single A100 80GB. This is why PEFT methods like LoRA are essential for large models. They reduce the 16 bytes/param down to ~2 bytes/param for frozen weights + 16 bytes/param for just the tiny adapter.

Exercise 7.2: LoRA Trainable Parameters Derive

Apply LoRA with rank=16 to Q and V projections of BERT-base (d=768, L=12 layers). Each adapter pair has shapes A:[d,r] + B:[r,d]. How many trainable parameters total?

million trainable params
Show derivation
Per adapter pair: d×r + r×d = 768×16 + 16×768 = 24,576
Per layer (Q + V): 2 × 24,576 = 49,152
All 12 layers: 12 × 49,152 = 589,824
= 0.59M trainable parameters
Ratio to full model: 0.59M / 110M = 0.54%

Only 0.54% of parameters are trainable! The other 99.46% are frozen, requiring only 2 bytes/param (for inference). LoRA memory: frozen weights (110M × 2 = 220 MB) + trainable weights+optimizer (0.59M × 16 = 9.4 MB) = ~230 MB total. Compare to 1.76 GB for full fine-tuning.

Exercise 7.3: Prefix Tuning Derive

Prefix tuning prepends k learnable "virtual tokens" to the key and value sequences at each layer. With k=20 virtual tokens, d=768, L=12 layers. How many trainable parameters?

Each layer gets k virtual key vectors [k, d] and k virtual value vectors [k, d].

million trainable params
Show derivation
Per layer: k × d (keys) + k × d (values) = 2 × 20 × 768 = 30,720
All 12 layers: 12 × 30,720 = 368,640
= 0.37M trainable parameters

Even fewer than LoRA! But prefix tuning has a downside: the k virtual tokens consume positions in the sequence length budget. With k=20 and max sequence 512, you effectively lose 20 positions (3.9%). For long-context tasks, this matters. LoRA doesn't consume any sequence positions.

Exercise 7.4: Compare Approaches Design

Order these fine-tuning methods from most to fewest trainable parameters (for BERT-base, d=768, L=12):

?
>
?
>
?
>
?
Full fine-tuning (110M) LoRA r=16, Q+V (0.59M) Prefix tuning k=20 (0.37M) Linear head only (0.0015M)
Show explanation

Correct order: Full (110M) > LoRA (0.59M) > Prefix (0.37M) > Head only (0.0015M = 768 × 2 for binary classification).

Performance generally follows the same order, but with diminishing returns: LoRA with 0.5% of parameters typically reaches 95-99% of full fine-tuning quality. The linear head alone can work for simple tasks where the pretrained representations already capture what you need.

Exercise 7.5: Implement LoRA Forward Pass Build

Write a function that computes the LoRA-modified output: y = Wx + BAx, where W is the frozen weight, B and A are low-rank adapter matrices.

Use nested loops for matrix-vector multiplication. W[i][j] * x[j] etc.
Show solution
javascript
function loraForward(x, W, A, B) {
  const dOut = W.length, dIn = x.length, r = A.length;
  // Compute Wx
  const Wx = Array(dOut).fill(0);
  for (let i = 0; i < dOut; i++)
    for (let j = 0; j < dIn; j++)
      Wx[i] += W[i][j] * x[j];
  // Compute Ax (r-dim)
  const Ax = Array(r).fill(0);
  for (let i = 0; i < r; i++)
    for (let j = 0; j < dIn; j++)
      Ax[i] += A[i][j] * x[j];
  // Compute BAx (dOut-dim)
  const BAx = Array(dOut).fill(0);
  for (let i = 0; i < dOut; i++)
    for (let j = 0; j < r; j++)
      BAx[i] += B[i][j] * Ax[j];
  // y = Wx + BAx
  return Wx.map((v, i) => v + BAx[i]);
}
Exercise 7.6: LoRA vs Full FT Memory Derive

For a 7B parameter model: compare memory for (A) full fine-tuning (16 bytes/param) vs (B) LoRA r=16 on Q,V with d=4096, L=32 (frozen base at 2 bytes/param + 16 bytes/param for LoRA weights only). What is the memory ratio B/A?

ratio (B/A)
Show derivation
(A) Full FT: 7B × 16 = 112 GB
(B) LoRA trainable: 2 × 2 × 4096 × 16 × 32 = 8,388,608 ≈ 8.4M params
(B) Memory: 7B × 2 (frozen) + 8.4M × 16 (LoRA optimizer) = 14.0 GB + 0.13 GB ≈ 14.1 GB
Ratio = 14.1 / 112 ≈ 0.126

LoRA uses ~12.5% of the memory of full fine-tuning — making it feasible to fine-tune a 7B model on a single 24GB GPU (14 GB for weights + a few GB for activations). Full fine-tuning would need at least 2x A100 80GB just for the optimizer states.

Chapter 8: Evaluation Metrics

Your model generates text that looks reasonable to you, but you need to convince your team it's actually good. NLP evaluation metrics provide automatic, reproducible scores. Each metric captures a different aspect of quality: BLEU measures precision of n-gram overlap (machine translation), ROUGE measures recall of n-gram overlap (summarization), and F1 measures exact token overlap (extractive QA).

BLEU (Bilingual Evaluation Understudy):
pn = (# matching n-grams in hypothesis) / (# total n-grams in hypothesis)
BLEU = BP · exp( ∑n=1N wn log pn )
BP = min(1, exp(1 - r/c))   // brevity penalty

ROUGE-L (Longest Common Subsequence):
LCS = length of longest common subsequence
Precision = LCS / |hypothesis|, Recall = LCS / |reference|
F1 = 2 × P × R / (P + R)

Extractive QA F1:
F1 over tokens: P = |pred ∩ gold| / |pred|, R = |pred ∩ gold| / |gold|
BLEU and ROUGE are opposites. BLEU is precision-based: "of the words the model generated, how many match the reference?" Good for translation where you want the output to be accurate. ROUGE is recall-based: "of the words in the reference, how many does the model capture?" Good for summarization where you want coverage of key information.
Exercise 8.1: BLEU-1 (Unigram Precision) Derive

Reference: "the cat is on the mat". Hypothesis: "the the the the the the". Compute the naive unigram precision (# matching unigrams / # hypothesis unigrams). Then compute the clipped unigram precision (each word's match count is clipped to its count in the reference).

clipped unigram precision
Show derivation

Naive precision: All 6 words in the hypothesis are "the", and "the" appears in the reference. So naive precision = 6/6 = 1.0. This is clearly wrong — the hypothesis is terrible!

Naive: 6 matches / 6 hypothesis words = 1.0 (broken!)

Clipped precision: "the" appears 2 times in the reference, so we clip the match count to 2:

Clipped: min(6, 2) / 6 = 2/6 = 0.333

Clipping prevents gaming by repeating high-frequency words. BLEU always uses clipped counts. This is why BLEU is called "modified n-gram precision" — the modification is this clipping step.

Exercise 8.2: BLEU-2 (Bigram Precision) Derive

Reference: "the cat sat on the mat". Hypothesis: "the cat on the mat". Compute the clipped bigram precision (2-gram precision).

Hypothesis bigrams: (the,cat), (cat,on), (on,the), (the,mat). Reference bigrams: (the,cat), (cat,sat), (sat,on), (on,the), (the,mat).

bigram precision
Show derivation
Hypothesis bigrams: {(the,cat), (cat,on), (on,the), (the,mat)} — 4 bigrams
Reference bigrams: {(the,cat), (cat,sat), (sat,on), (on,the), (the,mat)}
Matches: (the,cat)✓, (cat,on)✗, (on,the)✓, (the,mat)✓
Clipped bigram precision = 3/4 = 0.75

The hypothesis correctly captures 3 of 4 bigrams but misses (cat,sat) because "sat" was omitted. The missing word "sat" breaks one bigram pair. Higher-order n-grams are more sensitive to word order and fluency — BLEU-4 (the standard) uses up to 4-grams.

Exercise 8.3: ROUGE-L (LCS) Derive

Reference: "police killed the gunman". Hypothesis: "police kill the gunman". What is the length of the longest common subsequence (LCS)? Then compute ROUGE-L F1.

LCS does NOT require consecutive tokens — subsequence allows gaps.

ROUGE-L F1
Show derivation
Reference: [police, killed, the, gunman] (4 words)
Hypothesis: [police, kill, the, gunman] (4 words)
LCS: "police", "the", "gunman" = 3 words
("killed" and "kill" don't match as exact tokens)
Precision = LCS / |hyp| = 3/4 = 0.75
Recall = LCS / |ref| = 3/4 = 0.75
F1 = 2 × 0.75 × 0.75 / (0.75 + 0.75) = 0.75

ROUGE-L F1 = 0.75. The score is penalized because "killed" vs "kill" is an exact token mismatch. Some ROUGE variants use stemming to handle this (both stem to "kill"), which would give LCS=4 and perfect F1=1.0. This shows why metric choice matters — stemming, tokenization, and case sensitivity all affect the final score.

Exercise 8.4: Extractive QA F1 Derive

Question: "When was the Eiffel Tower built?" Gold answer: "1887 to 1889". Predicted answer: "built in 1889". Compute the token-level F1 score.

F1 score
Show derivation
Gold tokens: {1887, to, 1889} — 3 tokens
Pred tokens: {built, in, 1889} — 3 tokens
Overlap: {1889} — 1 token
Precision = 1/3 = 0.333
Recall = 1/3 = 0.333
F1 = 2 × 0.333 × 0.333 / (0.333 + 0.333) = 0.333

F1 = 0.33, which seems harsh — the model got the year right but also extracted extra tokens ("built in"). In SQuAD evaluation, exact match (EM) would be 0 (not an exact match) but F1 captures partial credit. A predicted answer of just "1889" would get Precision=1.0, Recall=1/3, F1=0.5 — still penalized for missing "1887 to." The ideal answer is the exact span "1887 to 1889."

Exercise 8.5: Implement BLEU-1 Build

Write a function that computes clipped unigram precision (BLEU-1 without brevity penalty).

Count each word's occurrences in both, clip hypothesis count to reference count.
Show solution
javascript
function bleu1(hypothesis, reference) {
  // Count words in reference
  const refCount = {};
  for (const w of reference)
    refCount[w] = (refCount[w] || 0) + 1;
  // Count clipped matches
  const hypCount = {};
  let clippedMatches = 0;
  for (const w of hypothesis) {
    hypCount[w] = (hypCount[w] || 0) + 1;
  }
  for (const w in hypCount) {
    clippedMatches += Math.min(hypCount[w], refCount[w] || 0);
  }
  return hypothesis.length > 0 ? clippedMatches / hypothesis.length : 0;
}
Exercise 8.6: Brevity Penalty Derive

The BLEU brevity penalty is BP = min(1, exp(1 - r/c)) where r = reference length, c = hypothesis length. If the reference has 20 words and the hypothesis has 12 words, what is BP?

brevity penalty
Show derivation
BP = min(1, exp(1 - r/c)) = min(1, exp(1 - 20/12))
= min(1, exp(1 - 1.667)) = min(1, exp(-0.667))
= min(1, 0.513) = 0.513

The brevity penalty halves the BLEU score! This prevents a strategy of generating short, confident outputs. Without BP, a model could output a single highly-probable word and achieve perfect precision. BP kicks in whenever the hypothesis is shorter than the reference (c < r). When c ≥ r, BP = 1 (no penalty).

Chapter 9: Capstone: Build a Text Classifier

You've been hired to build a sentiment classifier for product reviews. You have 10,000 labeled training examples (positive/negative), a budget of $200 for GPU time, and a deadline of 3 days. Time to make engineering decisions — not every choice involves a novel algorithm, but every choice has cost, quality, and time implications.

This chapter ties everything together. You'll make decisions about tokenization, model selection, training configuration, and cost estimation. There's no single right answer — but there are wrong answers that waste your budget or miss your deadline.
Exercise 9.1: Tokenizer Choice Trace
Your reviews are in English with occasional emoji and product model numbers (e.g., "iPhone 15 Pro Max"). Which tokenization approach should you use?
Show reasoning

Always use the tokenizer that matches your pretrained model. BERT's WordPiece and RoBERTa's BPE handle OOV words, model numbers, and most emoji. Training a custom tokenizer on 10K examples would be vastly inferior to the pretrained tokenizer trained on billions of words. Character-level would work but makes sequences 4-5x longer, increasing compute. Word-level would choke on model numbers and rare product names.

Exercise 9.2: Model Selection Trace
For 10K training examples and a binary classification task (sentiment), which approach makes the most sense?
Show reasoning

With 10K examples, BERT-base or RoBERTa-base is the sweet spot. Training from scratch requires millions of examples. A 70B model with LoRA would work but is overkill — the marginal improvement over BERT-base on binary sentiment is minimal, while compute cost is 10-50x higher. Bag-of-words actually works reasonably (85-88%) but BERT easily reaches 93-95% with fine-tuning, and the compute cost is minimal ($1-5 on a single GPU for a few hours).

Exercise 9.3: Fine-tuning FLOPs Derive

Fine-tuning BERT-base (110M params) on 10K examples, sequence length 128, for 3 epochs. Approximate total FLOPs using the rule: FLOPs ≈ 6 × N × D where N = params and D = total tokens processed.

TFLOPs (1012)
Show derivation
D = 10,000 examples × 128 tokens × 3 epochs = 3,840,000 tokens
FLOPs = 6 × 110 × 106 × 3,840,000
= 6 × 110 × 106 × 3.84 × 106
= 2,534 × 1012 ≈ 2,534 TFLOPs

2,534 TFLOPs. An A100 delivers ~312 TFLOPS (BF16 with tensor cores). At 50% utilization: 2,534 / (312 × 0.5) = ~16 seconds of A100 time. This is why BERT fine-tuning is so cheap — even on consumer GPUs (RTX 3090 at 71 TFLOPS), it takes under 2 minutes. The $200 budget is wildly more than needed.

Exercise 9.4: Memory Budget Derive

BERT-base (110M params) with full fine-tuning (16 bytes/param for optimizer). Activation memory for batch_size=32, seq_len=128, d=768, L=12 layers is approximately 32 × 128 × 768 × 12 × 2 bytes. What is the total GPU memory needed (weights+optimizer+activations)?

GB
Show derivation
Weights + optimizer = 110M × 16 = 1,760 MB = 1.76 GB
Activations ≈ 32 × 128 × 768 × 12 × 2 = 75,497,472 bytes ≈ 75 MB
Total ≈ 1.76 GB + 0.075 GB + ~0.7 GB overhead ≈ 2.53 GB

~2.5 GB fits on any GPU made in the last 5 years (even a 4GB GTX 1050 would work with smaller batch size). This is why BERT fine-tuning democratized NLP — you don't need expensive hardware. The real cost is not compute but data labeling: 10,000 labeled examples at $0.10/label = $1,000, far more than the $5 of GPU time.

Exercise 9.5: Design the Pipeline Design

Put these fine-tuning steps in the correct order:

?
?
?
?
?
Tokenize with pretrained tokenizer Load pretrained BERT + add classification head Fine-tune (3 epochs, lr=2e-5, AdamW) Evaluate on held-out test set Split data: 80% train, 10% val, 10% test
Show explanation

Correct order: Split dataTokenizeLoad pretrained modelFine-tuneEvaluate

You must split before tokenizing to prevent data leakage. The model is loaded after tokenization because you need the tokenizer to prepare inputs. Fine-tuning uses the validation set for early stopping. Final evaluation on the test set happens only once — if you tune hyperparameters based on test performance, you're overfitting to the test set.

Exercise 9.6: Cost Estimation Derive

An A100 on AWS costs ~$3.00/hour (on-demand). Your fine-tuning takes ~16 seconds (from Exercise 9.3). Including hyperparameter search (5 runs) and evaluation, estimate total GPU hours and cost.

dollars (USD)
Show derivation
5 training runs × 16 seconds = 80 seconds
Evaluation overhead ≈ 20 seconds
Data loading/preprocessing ≈ 60 seconds
Total GPU time ≈ 160 seconds ≈ 0.044 hours
Cost = 0.044 × $3.00 = $0.13

Under $1 total! With the $200 budget, you could run 1,500 experiments. In practice, you'd spend more on the minimum billing increment (most cloud providers charge per-minute or per-second with a minimum). The real lesson: BERT-base fine-tuning on small datasets is essentially free. The engineering time to set up the pipeline (a few hours of your salary) vastly exceeds the compute cost.

Exercise 9.7: Find the Bug Debug

Your colleague's training code produces suspiciously high accuracy (99.8%) on the test set. Click the buggy line.

data = load_reviews('reviews.csv')
tokenized = tokenize(data)
model = BertForSequenceClassification.from_pretrained('bert-base')
model.fit(tokenized, epochs=3, lr=2e-5)
accuracy = model.evaluate(tokenized)  # eval on training data!
print(f'Test accuracy: {accuracy}')
Show explanation

Line 5 is the bug. The code evaluates on the same tokenized data used for training — there's no train/test split! The model memorized the training examples and the 99.8% accuracy is meaningless. The fix: split data into train/val/test before training, train on train, use val for early stopping, and evaluate on test only once at the end. Without this split, you have no idea how the model performs on unseen data.