NLP Fundamentals (CS224N) Workbook

Chapter 0: Word Vectors

How do you tell a computer that "king" and "queen" are related but "king" and "toaster" are not? You can't just alphabetize — "cat" and "car" are close alphabetically but worlds apart semantically. The insight behind word vectors is to represent each word as a point in a high-dimensional space where distance equals meaning.

In Word2Vec (skipgram), you train a model to predict context words from a center word. The key idea: if two words appear in similar contexts, they end up with similar vectors. The training window defines "context" — for window size w, context words are the w words before and after the center word.

Skipgram context extraction:
Given sentence: w₁ w₂ ... w_n
Center word: w_c
Context (window=m): { w_c-m, ..., w_c-1, w_c+1, ..., w_c+m }

Similarity:
Dot product: u · v = ∑_i u_iv_i
Cosine similarity: cos(θ) = (u · v) / (||u|| · ||v||)

The distributional hypothesis. "You shall know a word by the company it keeps" (Firth, 1957). Word2Vec turns this linguistic insight into geometry: words that share contexts become neighbors in vector space. This is why king - man + woman ≈ queen works — the "royalty" direction is orthogonal to the "gender" direction.

Exercise 0.1: Context Words Trace

Sentence: "the cat sat on the mat". Center word: "sat" (position 3, 1-indexed). Window size = 2. Which words are the context words?

{the, cat, on, the, mat} {cat, on} — only 1 word each side {the, cat, on, the} — 2 words each side {sat, on, the} — center word included

Show explanation

With window=2, we take 2 words to the left and 2 words to the right of the center word "sat" (position 3).

Left context: positions 1,2 → "the", "cat"

Right context: positions 4,5 → "on", "the"

Context = {the, cat, on, the}

Note: the center word is never included in its own context. Also note "the" appears twice — once from the left, once from the right. In practice Word2Vec treats these as separate training pairs: (sat, the), (sat, cat), (sat, on), (sat, the).

Exercise 0.2: Dot Product Similarity Derive

Given vectors: v_cat = [0.5, 0.8, -0.2], v_dog = [0.4, 0.9, -0.1]. Compute the dot product v_cat · v_dog.

(scalar)

Show derivation

v_cat · v_dog = (0.5)(0.4) + (0.8)(0.9) + (-0.2)(-0.1)

= 0.20 + 0.72 + 0.02 = 0.94

A positive dot product means the vectors point in roughly the same direction — the words are semantically similar. Compare this to v_cat · v_toaster which would be near zero for unrelated words.

Exercise 0.3: Cosine Similarity Derive

Using the same vectors: v_cat = [0.5, 0.8, -0.2], v_dog = [0.4, 0.9, -0.1]. Compute the cosine similarity (round to 2 decimal places).

Compute ||v_cat|| = √(0.25 + 0.64 + 0.04), ||v_dog|| = √(0.16 + 0.81 + 0.01), then divide dot product by the product of norms.

(between -1 and 1)

Show derivation

||v_cat|| = √(0.25 + 0.64 + 0.04) = √0.93 ≈ 0.9644

||v_dog|| = √(0.16 + 0.81 + 0.01) = √0.98 ≈ 0.9899

cos(θ) = 0.94 / (0.9644 × 0.9899) = 0.94 / 0.9547 ≈ 0.984

Cosine similarity of 0.98 — extremely similar! Cosine similarity normalizes for vector length, so it measures direction only. Two vectors with cosine similarity = 1.0 point the same way regardless of magnitude. This is why NLP prefers cosine over raw dot product for comparing embeddings.

Exercise 0.4: Implement cosineSimilarity() Build

Write a function that takes two arrays (vectors) of equal length and returns their cosine similarity.

Return a single number between -1 and 1.

Show solution

javascript
function cosineSimilarity(a, b) {
  let dot = 0, normA = 0, normB = 0;
  for (let i = 0; i < a.length; i++) {
    dot += a[i] * b[i];
    normA += a[i] * a[i];
    normB += b[i] * b[i];
  }
  return dot / (Math.sqrt(normA) * Math.sqrt(normB));
}

Exercise 0.5: Analogy Arithmetic Derive

Given: v_king = [0.8, 0.6, 0.9], v_man = [0.3, 0.1, 0.7], v_woman = [0.3, 0.5, 0.7]. Compute v_result = v_king - v_man + v_woman. What is the first component of v_result?

(first component)

Show derivation

v_result = [0.8, 0.6, 0.9] - [0.3, 0.1, 0.7] + [0.3, 0.5, 0.7]

= [0.8-0.3+0.3, 0.6-0.1+0.5, 0.9-0.7+0.7]

= [0.8, 1.0, 0.9]

The result [0.8, 1.0, 0.9] should be closest to v_queen in the embedding space. Notice the first component stayed at 0.8 (the "royalty" axis), while the second component jumped from 0.6 to 1.0 (the "femininity" axis). This is the geometric magic of word vector arithmetic: king - man + woman ≈ queen.

Exercise 0.6: Training Pairs Derive

Sentence: "I love deep learning models". Window size = 2. How many total (center, context) training pairs does Word2Vec skipgram generate?

Count each (center, context) pair. Edge words have fewer context words since the window gets clipped at sentence boundaries.

training pairs

Show derivation

For each word, count how many context words are within the window:

"I" (pos 1): right only = {love, deep} → 2 pairs

"love" (pos 2): left = {I}, right = {deep, learning} → 3 pairs

"deep" (pos 3): left = {I, love}, right = {learning, models} → 4 pairs

"learning" (pos 4): left = {love, deep}, right = {models} → 3 pairs

"models" (pos 5): left only = {deep, learning} → 2 pairs

Total = 2 + 3 + 4 + 3 + 2 = 14 pairs

"love" gets only 3 context words (not 4) because the window clips at the left boundary — there's only 1 word to its left. Edge effects reduce total pairs. The general formula: for a sentence of length n with window m, each word at position i produces min(m, i-1) + min(m, n-i) pairs.

Chapter 1: Word2Vec Gradients

Word2Vec learns by gradient descent, so we need to understand the objective function and its gradients. The skipgram model defines the probability of observing a context word o given center word c using the softmax over the entire vocabulary:

Skipgram probability (softmax):
P(o | c) = exp(u_o · v_c) / ∑_w∈V exp(u_w · v_c)

Loss for one (center, context) pair:
J = -log P(o | c) = -(u_o · v_c) + log ∑_w∈V exp(u_w · v_c)

Gradient w.r.t. v_c:
∂J / ∂v_c = -u_o + ∑_w∈V P(w|c) · u_w
// observed context vector minus expected context vector

The gradient has a beautiful interpretation. ∂J/∂v_c = -(u_o - E[u_w]). The gradient pushes v_c toward the observed context word u_o and away from the expected (probability-weighted average) context word. Training = making the center vector more like its actual neighbors.

Exercise 1.1: Softmax Probability Derive

Vocabulary: {a, b, c, d, e}. Center word "c" with v_c = [1, 0]. Context vectors: u_a=[1,1], u_b=[0,1], u_c=[1,0], u_d=[-1,0], u_e=[0,-1]. Compute P(a | c).

First compute all dot products u_w · v_c, then apply softmax.

(probability)

Show derivation

u_a · v_c = [1,1] · [1,0] = 1

u_b · v_c = [0,1] · [1,0] = 0

u_c · v_c = [1,0] · [1,0] = 1

u_d · v_c = [-1,0] · [1,0] = -1

u_e · v_c = [0,-1] · [1,0] = 0

exp values: e¹=2.718, e⁰=1, e¹=2.718, e^-1=0.368, e⁰=1

∑ = 2.718 + 1 + 2.718 + 0.368 + 1 = 7.804

P(a|c) = 2.718 / 7.804 ≈ 0.348

Words with higher dot product to the center word get higher probability. u_a and u_c both have dot product 1, so they share the highest probability. u_d has dot product -1, so it gets the lowest probability (0.047). This is exactly what we want: the model says "a" and "c" are likely context words for center word "c".

Exercise 1.2: The Softmax Bottleneck Trace

With a vocabulary of V = 50,000 words, computing the exact softmax for one training pair requires how many dot products?

1 dot product (just the context word) The window size (e.g. 5-10) 50,000 — one for every word in the vocabulary (the denominator sum) 50,000² — all pairs

Show explanation

The denominator ∑_w∈V exp(u_w · v_c) requires computing the dot product of v_c with every context vector u_w in the vocabulary. For V = 50,000, that's 50,000 dot products per training pair. With billions of training pairs, this is prohibitively expensive. This is the softmax bottleneck — and it's exactly why negative sampling was invented.

Exercise 1.3: Negative Sampling Loss Derive

Negative sampling replaces the full softmax with a binary classification. Instead of normalizing over V words, we maximize log σ(u_o · v_c) + ∑_k log σ(-u_k · v_c) for k negative samples. Given: u_o · v_c = 2.0, and 2 negative samples with u_k1 · v_c = -0.5, u_k2 · v_c = 0.3. Compute the negative sampling loss J_NS = -[log σ(2.0) + log σ(0.5) + log σ(-0.3)].

σ(x) = 1/(1+e^-x). So σ(2.0)≈0.881, σ(0.5)≈0.622, σ(-0.3)≈0.426.

(loss value)

Show derivation

Positive term: σ(u_o · v_c) = σ(2.0). Negative terms use σ(-u_k · v_c):

σ(2.0) = 1/(1+e^-2) = 1/1.135 ≈ 0.881

σ(-(-0.5)) = σ(0.5) = 1/(1+e^-0.5) ≈ 0.622

σ(-(0.3)) = σ(-0.3) = 1/(1+e^0.3) ≈ 0.426

J_NS = -(ln 0.881 + ln 0.622 + ln 0.426)

= -(-0.127 - 0.476 - 0.853) = 1.456

The key subtlety: negative samples use σ(-u_k · v_c), which encourages high sigmoid output when the dot product is negative (i.e., the words are dissimilar). With just k=2 negatives instead of V=50,000 terms in the softmax denominator, training is orders of magnitude faster.

Exercise 1.4: Gradient Computation Derive

Using the 5-word vocabulary from Exercise 1.1, with the observed context word being "a": compute the gradient ∂J/∂v_c = -u_o + ∑_w P(w|c) · u_w. What is the first component of the gradient vector?

Use the probabilities: P(a|c)=0.348, P(b|c)=0.128, P(c|c)=0.348, P(d|c)=0.047, P(e|c)=0.128. The expected vector E[u_w] = ∑ P(w|c) · u_w.

(first component of gradient)

Show derivation

First compute E[u_w] (first component only):

E[u_w]₁ = 0.348(1) + 0.128(0) + 0.348(1) + 0.047(-1) + 0.128(0)

= 0.348 + 0 + 0.348 - 0.047 + 0 = 0.649

∂J/∂v_c (first component) = -u_a,1 + E[u_w]₁ = -1 + 0.649 = -0.351

The gradient is negative in the first component, meaning the update will increase v_c in that direction — pushing v_c toward the observed context word u_a. The gradient literally says: "move the center vector closer to the word you actually observed."

Exercise 1.5: Negative Sampling Efficiency Trace

Negative sampling with k=15 on a vocabulary of V=100,000. By what factor does negative sampling reduce computation per training step compared to full softmax?

~10x faster ~100x faster ~6,250x faster — we compute 16 terms instead of 100,000 Same speed, just a different loss function

Show explanation

Full softmax: V = 100,000 dot products per step

Negative sampling: 1 positive + 15 negatives = 16 dot products per step

Speedup = 100,000 / 16 = 6,250×

This is why negative sampling made Word2Vec practical. The original paper (Mikolov et al., 2013) reported that k=5-20 negative samples works well for small datasets, while k=2-5 suffices for very large corpora. The quality barely degrades because most words in the vocabulary are irrelevant to any given center word — you only need a small sample of negatives to approximate the full gradient.

Exercise 1.6: Find the Bug Debug

This negative sampling loss implementation has a subtle bug. Click the buggy line.

function negSamplingLoss(vc, uo, negVecs) {
  let loss = -Math.log(sigmoid(dot(uo, vc)));
  for (let k = 0; k < negVecs.length; k++) {
    loss -= Math.log(sigmoid(dot(negVecs[k], vc)));
  }
  return loss;
}

Show explanation

Line 4 is the bug. For negative samples, we want σ(-u_k · v_c), not σ(u_k · v_c). The correct line should be: loss -= Math.log(sigmoid(-dot(negVecs[k], vc)));

Without the negation, the loss encourages high similarity with negative samples instead of discouraging it. The model would learn to make all words similar — the exact opposite of what we want.

Chapter 3: Language Model Perplexity

You've trained two language models and both look reasonable. How do you decide which one is better? You can't just eyeball their generated text — you need a number. Perplexity is that number. It measures how "surprised" the model is by real text. A model that assigns high probability to actual sentences has low perplexity.

Perplexity:
PP(W) = exp( -1/N ∑_i=1^N log P(w_i | w₁...w_i-1) )

Equivalently:
PP(W) = ( ∏_i=1^N P(w_i | w₁...w_i-1) )^-1/N

Interpretation:
PP = weighted average branching factor
PP = 10 means the model is as uncertain as choosing among 10 equally likely words

Perplexity is the exponential of cross-entropy loss. When you train a language model with cross-entropy loss and get a validation loss of 3.5, the perplexity is exp(3.5) = 33.1. This means the model is, on average, as confused as if it were picking uniformly from ~33 words at each position. Modern LLMs achieve perplexities of 5-15 on typical benchmarks.

Exercise 3.1: Compute Perplexity Derive

A language model assigns these probabilities to a 5-word sentence: P(w₁)=0.1, P(w₂|w₁)=0.3, P(w₃|w_1:2)=0.5, P(w₄|w_1:3)=0.2, P(w₅|w_1:4)=0.4. Compute the perplexity.

perplexity

Show derivation

-1/N ∑ log P = -1/5 [log(0.1) + log(0.3) + log(0.5) + log(0.2) + log(0.4)]

= -1/5 [-2.303 + (-1.204) + (-0.693) + (-1.609) + (-0.916)]

= -1/5 × (-6.725) = 1.345

PP = exp(1.345) ≈ 3.84

Alternative computation via the product formula: (0.1 × 0.3 × 0.5 × 0.2 × 0.4)^-1/5 = (0.0012)^-0.2 = (1/0.0012)^0.2 = 833.3^0.2 ≈ 3.84. Both methods give the same answer. The model is about as confused as picking from ~4 equally likely options.

Exercise 3.2: Cross-Entropy to Perplexity Derive

Your language model reports a validation cross-entropy loss of 2.8 (nats, i.e. natural log base). What is the perplexity?

perplexity

Show derivation

PP = exp(cross-entropy) = exp(2.8) = e^2.8 ≈ 16.44

This is the most common conversion you'll do in practice. Training loss of 2.8 → perplexity of ~16.4. For reference: GPT-2 achieved ~29.4 perplexity on WikiText-103, GPT-3 improved to ~20.5, and current frontier models are in the 5-10 range on similar benchmarks.

Exercise 3.3: Which Model is Better? Trace

Model A has perplexity 22.1 on the test set. Model B has perplexity 15.3. Which model is better, and what does the difference mean concretely?

Model A — higher perplexity means more confident Model B — it's choosing among ~15 options vs ~22, so it's less confused by real text Can't tell — perplexity depends on vocabulary size

Show explanation

Lower perplexity = better model. Model B (PP=15.3) assigns higher probability to the test text than Model A (PP=22.1). Concretely: at each word position, Model B is as uncertain as choosing from ~15 options, while Model A is as uncertain as choosing from ~22 options. Model B has a tighter, more accurate distribution over next words.

Caveat: perplexity must be compared on the same test set with the same vocabulary. A model with vocabulary size 1000 will naturally have lower perplexity than one with vocabulary 50000 on the same text, simply because it has fewer choices.

Exercise 3.4: Implement perplexity() Build

Write a function that takes an array of token probabilities and returns the perplexity.

Use Math.log (natural log) and Math.exp.

Show solution

javascript
function perplexity(probs) {
  const N = probs.length;
  let sumLogP = 0;
  for (let i = 0; i < N; i++) {
    sumLogP += Math.log(probs[i]);
  }
  return Math.exp(-sumLogP / N);
}

Exercise 3.5: Perplexity Lower Bound Derive

A uniform language model assigns probability 1/V to every word regardless of context, where V is the vocabulary size. If V = 50,000, what is the perplexity of this model?

perplexity

Show derivation

PP = exp(-1/N ∑ log(1/V)) = exp(-1/N × N × log(1/V))

= exp(-log(1/V)) = exp(log(V)) = V = 50,000

The uniform model has perplexity exactly equal to the vocabulary size. This is the worst-case baseline — any model that learns anything about language should beat this. For V=50,000, the model is maximally confused, picking randomly from all 50,000 words at each position. A perplexity of 15 means the model has reduced its effective choices from 50,000 to 15 — a 3,333× improvement.

Chapter 4: Attention Mechanisms

Before transformers, seq-to-seq models compressed the entire input sentence into a single fixed-length vector. For long sentences, this bottleneck destroyed information. Attention (Bahdanau et al., 2014) lets the decoder look back at all encoder states and focus on the most relevant ones for each output word.

Bahdanau (additive) attention:
e_ij = v^T tanh(W₁s_i + W₂h_j)  // alignment score
α_ij = softmax(e_ij) = exp(e_ij) / ∑_k exp(e_ik)  // attention weight
c_i = ∑_j α_ij h_j  // context vector

Luong (multiplicative) attention:
e_ij = s_i^T W h_j  // or just s_i^Th_j (dot product)

Attention = soft dictionary lookup. Think of the encoder states as key-value pairs and the decoder state as a query. Attention computes a relevance score between the query and each key, normalizes with softmax, and returns a weighted sum of values. This is exactly the Q, K, V framework that transformers generalize.

Exercise 4.1: Alignment Scores Derive

Three encoder states (source words): h₁=[1,0], h₂=[0,1], h₃=[1,1]. One decoder state: s=[0.5, 0.8]. Using dot-product attention (Luong, no W matrix), compute the alignment score e₁ = s · h₁.

(score)

Show derivation

e₁ = s · h₁ = [0.5, 0.8] · [1, 0] = 0.5 × 1 + 0.8 × 0 = 0.5

Exercise 4.2: All Scores + Softmax Derive

Using the same vectors, compute all three alignment scores and then the attention weights α₁, α₂, α₃ via softmax. What is α₃ (the weight on h₃=[1,1])?

(attention weight)

Show derivation

e₁ = [0.5,0.8] · [1,0] = 0.5

e₂ = [0.5,0.8] · [0,1] = 0.8

e₃ = [0.5,0.8] · [1,1] = 1.3

exp(0.5) = 1.649, exp(0.8) = 2.226, exp(1.3) = 3.669

∑ = 1.649 + 2.226 + 3.669 = 7.544

α₁ = 1.649/7.544 = 0.219

α₂ = 2.226/7.544 = 0.295

α₃ = 3.669/7.544 = 0.486

h₃ gets the highest attention weight (~49%) because it has the largest dot product with the decoder state. This makes sense: h₃=[1,1] is the most aligned with s=[0.5,0.8] — it's the source word the decoder "wants" to look at most.

Exercise 4.3: Context Vector Derive

Using the attention weights from 4.2 (α₁≈0.219, α₂≈0.295, α₃≈0.486), compute the context vector c = ∑ α_jh_j. What is the first component of c?

(first component)

Show derivation

c₁ = 0.219 × 1 + 0.295 × 0 + 0.486 × 1 = 0.705

c₂ = 0.219 × 0 + 0.295 × 1 + 0.486 × 1 = 0.781

c = [0.705, 0.781]

The context vector is a weighted average of the encoder states. Since h₃ has the highest weight, c is pulled strongly toward h₃=[1,1]. This vector summarizes "what the decoder should know about the input" for generating the current output word.

Exercise 4.4: Bahdanau vs Luong Trace

Bahdanau attention uses an additive score e = v^Ttanh(W₁s + W₂h) with learnable W₁, W₂, v. Luong attention uses e = s^TWh (or just s^Th for dot). Which statement is correct?

Bahdanau is always more accurate because it has more parameters Luong is always better because dot products are faster Bahdanau can model complex (non-linear) query-key interactions; Luong is faster and simpler. Transformers use scaled dot-product (Luong-style) because it parallelizes via matrix multiplication. They're mathematically equivalent

Show explanation

Bahdanau uses a tanh nonlinearity, so it can capture non-linear relationships between query and key. But this prevents efficient batching — you can't compute all scores with a single matrix multiply. Luong's dot product can be computed as S × H^T, a single GEMM call that GPUs love. This is why transformers chose scaled dot-product attention: QK^T/√d_k is just Luong attention with a scaling factor to prevent softmax saturation.

Exercise 4.5: Which Source Word Gets Attention? Trace

You're translating "Le chat noir" (The black cat) to English. The decoder is generating the word "cat". Encoder states correspond to "Le" (the), "chat" (cat), "noir" (black). Which encoder state should get the highest attention weight?

"Le" — it's the first word, so it should be attended to first "chat" — it's the direct translation of "cat" "noir" — adjectives in French come after the noun All three equally — they all contribute to the translation

Show explanation

"chat" is the French word for "cat," so it should receive the highest attention weight when generating "cat." This is exactly what attention was designed for: it learns to align source and target words. In French-to-English translation, attention handles word reordering naturally — "Le chat noir" becomes "The black cat" by attending to the right word at each step, regardless of position differences between the languages.

Exercise 4.6: Scaled Dot-Product Derive

Transformers use scaled dot-product: score = q · k / √d_k. If d_k = 64 and q · k = 16, what is the scaled score?

(scaled score)

Show derivation

score = q · k / √d_k = 16 / √64 = 16 / 8 = 2.0

Without scaling, dot products grow proportionally to d_k (expected magnitude ≈ √d_k for random vectors). Large dot products push softmax into saturated regions where gradients vanish. Dividing by √d_k keeps the variance at 1, regardless of dimension. The original "Attention Is All You Need" paper (Vaswani et al., 2017) explains this in Section 3.2.1.

Chapter 5: Subword Tokenization

Word-level tokenization has a fatal flaw: what happens when the model encounters a word it's never seen? Out-of-vocabulary (OOV) words are replaced with a generic [UNK] token, destroying meaning. "Unforgettable" becomes [UNK]. The solution is subword tokenization: break words into smaller pieces that cover all possible words.

Byte-Pair Encoding (BPE) starts with individual characters and iteratively merges the most frequent pair. After training, the vocabulary contains a mix of characters, subwords ("un", "##ing"), and full words ("the").

BPE Algorithm:
1. Start with character vocabulary + end-of-word symbol
2. Count all adjacent character pairs in corpus
3. Merge the most frequent pair into a new token
4. Repeat step 2-3 for k merge operations

Vocabulary size after k merges:
|V| = |initial chars| + k

Subword tokenization is a spectrum. Character-level: vocabulary of ~256 but sequences are very long. Word-level: short sequences but huge vocabulary and OOV problems. Subword: a sweet spot. GPT-2 uses ~50K BPE tokens, LLaMA uses ~32K SentencePiece tokens. The vocabulary size directly affects the embedding table size (V × d parameters) and the softmax computation over V classes.

Exercise 5.1: BPE Merge Step 1 Trace

Initial vocabulary: {a, b, c, d, _} (where _ is space/end-of-word). Corpus token frequencies: "a b _" appears 5 times, "a b c _" appears 3 times, "d b c _" appears 2 times. The pair counts are: (a,b)=8, (b,_)=5, (b,c)=5, (c,_)=5, (d,b)=2. Which pair is merged first?

(a, b) → "ab" — it has the highest frequency (8) (b, c) — it appears in both "abc" and "dbc" (c, _) — end-of-word pairs should be merged first (d, b) — rarer pairs give better compression

Show explanation

BPE always merges the most frequent pair. (a,b) appears 8 times — more than any other pair. After merging, the new token "ab" replaces all occurrences of "a b" in the corpus. The vocabulary becomes {a, b, c, d, _, ab} with 6 tokens.

Exercise 5.2: Vocabulary Size Derive

You start with a character vocabulary of 256 (all bytes). After performing 32,000 merge operations, what is the final vocabulary size?

tokens

Show derivation

|V| = |initial chars| + k = 256 + 32,000 = 32,256

Each merge creates exactly one new token (the concatenation of the merged pair) without removing the original characters. So the vocabulary grows by exactly 1 per merge. This is how you choose your target vocabulary size: set k = desired_V - initial_chars. LLaMA's 32K vocabulary used approximately 32,000 - 256 = 31,744 merges.

Exercise 5.3: BPE Merge Trace Derive

Corpus: "low lower lowest" with word frequencies: "low"=5, "lower"=2, "lowest"=1. Initial tokens (with end-of-word ¤): l o w ¤ (freq 8), e r ¤ (freq 2), e s t ¤ (freq 1). After 3 merge operations, how many tokens are in the vocabulary? Start from initial character set: {l, o, w, e, r, s, t, ¤} = 8 tokens.

tokens

Show derivation

Count pair frequencies across all words (weighted by word frequency):

(l,o): 5+2+1 = 8, (o,w): 5+2+1 = 8, (w,¤): 5, (w,e): 2+1 = 3

(e,r): 2, (r,¤): 2, (e,s): 1, (s,t): 1, (t,¤): 1

Merge 1: (l,o) → "lo" — freq 8 (tied with (o,w), pick first)

Merge 2: (lo,w) → "low" — now the most frequent pair

Merge 3: (low,¤) → "low¤" — freq 5

Vocabulary = {l, o, w, e, r, s, t, ¤, lo, low, low¤} = 8 + 3 = 11

Exercise 5.4: Why Subword > Word-Level? Trace

A word-level model with V=100,000 words encounters the word "ChatGPT-4o" which is not in its vocabulary. A BPE model with V=32,000 tokens has never seen this exact word either. What happens in each case?

Both map it to [UNK] — neither has seen the word Word-level: [UNK] (meaning destroyed). BPE: ["Chat", "G", "PT", "-", "4", "o"] — the model can still reason about the pieces. Word-level handles it better because it has a larger vocabulary BPE maps it to single characters — no better than [UNK]

Show explanation

BPE's key advantage is that it cannot produce [UNK]. Any string can be decomposed into known subwords — in the worst case, into individual bytes. The model can still compose meaning from pieces: "Chat" (conversation), "GPT" (common in AI context), "4" (number), "o" (suffix). Word-level models are helpless with OOV words, making them brittle against new terminology, typos, and morphological variation.

Exercise 5.5: Embedding Table Impact Derive

Compare the embedding table size for: (A) word-level with V=100,000, d=768 vs (B) BPE with V=32,000, d=768. How many fewer parameters does BPE use (in millions)?

million fewer params

Show derivation

Word-level: 100,000 × 768 = 76,800,000 = 76.8M

BPE: 32,000 × 768 = 24,576,000 = 24.6M

Difference = 76.8M - 24.6M = 52.2M fewer parameters

52M parameters saved just in the embedding table! For GPT-2 Small (124M total), this is ~42% of the entire model. Smaller vocabulary also speeds up the output softmax layer. The tradeoff: BPE sequences are ~20-30% longer than word-level (e.g., "unforgettable" = 1 word-token vs 3 BPE tokens), increasing compute per sentence.

Chapter 6: Pretraining Objectives

The magic of modern NLP is pretraining: train a huge model on a vast corpus with a self-supervised objective, then fine-tune on your specific task. The two dominant objectives are Masked Language Modeling (MLM) used by BERT and Causal Language Modeling (CLM) used by GPT.

MLM (BERT):
Mask 15% of tokens randomly. Of masked tokens:
80% → replace with [MASK]
10% → replace with random token
10% → keep original
Predict original token at masked positions.

CLM (GPT):
Given tokens w₁...w_t-1, predict w_t.
Loss = -∑_t=1^T log P(w_t | w₁...w_t-1)
// Every position is a prediction target

MLM sees both directions, CLM generates. BERT masks tokens and predicts them using both left and right context — powerful for understanding but can't generate text naturally. GPT predicts left-to-right only — weaker for understanding but can generate fluently. This fundamental tradeoff shaped the entire field: BERT-family for classification/NLU, GPT-family for generation.

Exercise 6.1: MLM Prediction Count Derive

BERT processes a sequence of 512 tokens with 15% masking. How many tokens does the model predict per sequence? And of those, how many see [MASK] vs random token vs original?

masked predictions per sequence

Show derivation

Masked = 0.15 × 512 = 76.8 ≈ 77 tokens

Of those 77: [MASK] = 80% × 77 ≈ 62 tokens

Random replacement = 10% × 77 ≈ 8 tokens

Kept original = 10% × 77 ≈ 7 tokens

Only 77 out of 512 positions generate a loss signal — that's 15% efficiency. CLM generates loss at all 512 positions, making it 6.7x more data-efficient per sequence. This is one reason GPT-style models eventually overtook BERT: they extract more learning from each training example.

Exercise 6.2: CLM vs MLM Efficiency Derive

CLM predicts at all T positions. MLM predicts at ~15% of T positions. For a training corpus of 300B tokens, how many prediction events does each objective produce?

billion MLM predictions

Show derivation

CLM predictions = 300B (one per token position)

MLM predictions = 0.15 × 300B = 45B

CLM/MLM ratio = 300/45 = 6.67× more predictions

CLM gets 6.7x more training signal from the same data. This is misleading though — MLM predictions see bidirectional context, which is arguably a richer learning signal per prediction. The field has debated this tradeoff extensively. Chinchilla (Hoffmann et al., 2022) showed that for CLM, optimal training uses roughly 20 tokens per parameter. For a 7B model, that's ~140B tokens.

Exercise 6.3: Why 80/10/10? Trace

BERT masks 15% of tokens, but of those, only 80% become [MASK] while 10% become random tokens and 10% stay the same. Why not use [MASK] for all 15%?

It's just a regularization trick — there's no deep reason During fine-tuning, [MASK] tokens never appear — the model needs practice predicting from real tokens too. The 10/10 split prevents a train/test distribution mismatch. It reduces computation by 20% The 10% random tokens teach the model to be robust to noise

Show explanation

The core issue is pretrain-finetune mismatch. During fine-tuning, the model never sees [MASK] tokens — all inputs are real words. If the model only learned to predict from [MASK] tokens, it would struggle with real inputs. The 10% "keep original" and 10% "random token" ensure the model also learns to predict from non-[MASK] positions, reducing the distribution shift at fine-tuning time. The BERT paper (Devlin et al., 2019) discusses this in Section 3.1.

Exercise 6.4: Find the Bug Debug

This MLM masking function has a bug. The model trains fine but downstream performance is poor. Click the buggy line.

function applyMLMMask(tokens, maskProb) {
  const labels = tokens.slice();
  for (let i = 0; i < tokens.length; i++) {
    if (Math.random() < maskProb) {
      tokens[i] = MASK_TOKEN;  // always [MASK]
    } else {
      labels[i] = -100;  // ignore in loss
    }
  }
  return { tokens, labels };
}

Show explanation

Line 5 is the bug. All selected positions are replaced with [MASK], ignoring the 80/10/10 split. The correct implementation should: 80% of the time use [MASK], 10% use a random token, and 10% keep the original. Without this split, the model overfits to the [MASK] token distribution and suffers from pretrain-finetune mismatch. The fix:

javascript
const r = Math.random();
if (r < 0.8) tokens[i] = MASK_TOKEN;
else if (r < 0.9) tokens[i] = randomToken();
// else: keep original (10%)

Exercise 6.5: CLM Causal Mask Trace

In CLM, a causal attention mask prevents position t from attending to positions t+1, t+2, ... Why is this necessary?

It reduces computation by half (lower triangular instead of full matrix) It prevents overfitting to long-range dependencies Without it, the model can "cheat" by looking at the token it's supposed to predict — making the task trivial and learning nothing useful. It's a regularization technique, not strictly necessary

Show explanation

If position t can see position t+1, then when predicting w_t+1, the model already has w_t+1 in its context — it just copies the answer instead of learning to predict. The causal mask enforces the autoregressive property: each prediction only uses past tokens. This is also what makes CLM naturally suited for generation: at inference time, the model generates one token at a time, always conditioning only on previous tokens, which exactly matches training.

Chapter 7: Fine-tuning & PEFT

You have a pretrained BERT-base (110M parameters) and want to classify movie reviews as positive or negative. Full fine-tuning updates all 110M parameters — effective but expensive. Parameter-Efficient Fine-Tuning (PEFT) methods freeze most parameters and only train a tiny fraction, achieving similar quality at a fraction of the cost.

Full fine-tuning memory (mixed precision with Adam):
Weights: 2 bytes/param (BF16)
Gradients: 2 bytes/param (BF16)
Adam states: 4 + 4 = 8 bytes/param (FP32 momentum + variance)
Master weights: 4 bytes/param (FP32 copy for optimizer)
Total: ~16 bytes/param

LoRA (Low-Rank Adaptation):
Freeze base weights W [d, d]. Add: W + ΔW = W + BA
A: [d, r], B: [r, d] // r << d
Trainable params per adapted layer: 2 × d × r

LoRA's key insight: weight updates are low-rank. When fine-tuning a large model on a specific task, the actual change to the weight matrices (ΔW) has much lower rank than d. If ΔW is approximately rank-16, then representing it as BA where B:[r,d] and A:[d,r] with r=16 captures the update with 2×d×r parameters instead of d×d. For d=768, that's 24,576 instead of 589,824 — a 24× reduction per layer.

Exercise 7.1: Full Fine-tuning Memory Derive

BERT-base has 110M parameters. Full fine-tuning with mixed-precision Adam requires ~16 bytes/param. How much GPU memory just for weights + optimizer? (Ignore activations.)

Show derivation

Memory = 110 × 10⁶ × 16 bytes = 1,760,000,000 bytes

= 1,760 MB = 1.76 GB

1.76 GB fits easily on any modern GPU. But for a 7B model: 7 × 10⁹ × 16 = 112 GB — that's more than a single A100 80GB. This is why PEFT methods like LoRA are essential for large models. They reduce the 16 bytes/param down to ~2 bytes/param for frozen weights + 16 bytes/param for just the tiny adapter.

Exercise 7.2: LoRA Trainable Parameters Derive

Apply LoRA with rank=16 to Q and V projections of BERT-base (d=768, L=12 layers). Each adapter pair has shapes A:[d,r] + B:[r,d]. How many trainable parameters total?

million trainable params

Show derivation

Per adapter pair: d×r + r×d = 768×16 + 16×768 = 24,576

Per layer (Q + V): 2 × 24,576 = 49,152

All 12 layers: 12 × 49,152 = 589,824

= 0.59M trainable parameters

Ratio to full model: 0.59M / 110M = 0.54%

Only 0.54% of parameters are trainable! The other 99.46% are frozen, requiring only 2 bytes/param (for inference). LoRA memory: frozen weights (110M × 2 = 220 MB) + trainable weights+optimizer (0.59M × 16 = 9.4 MB) = ~230 MB total. Compare to 1.76 GB for full fine-tuning.

Exercise 7.3: Prefix Tuning Derive

Prefix tuning prepends k learnable "virtual tokens" to the key and value sequences at each layer. With k=20 virtual tokens, d=768, L=12 layers. How many trainable parameters?

Each layer gets k virtual key vectors [k, d] and k virtual value vectors [k, d].

million trainable params

Show derivation

Per layer: k × d (keys) + k × d (values) = 2 × 20 × 768 = 30,720

All 12 layers: 12 × 30,720 = 368,640

= 0.37M trainable parameters

Even fewer than LoRA! But prefix tuning has a downside: the k virtual tokens consume positions in the sequence length budget. With k=20 and max sequence 512, you effectively lose 20 positions (3.9%). For long-context tasks, this matters. LoRA doesn't consume any sequence positions.

Exercise 7.4: Compare Approaches Design

Order these fine-tuning methods from most to fewest trainable parameters (for BERT-base, d=768, L=12):

Full fine-tuning (110M) LoRA r=16, Q+V (0.59M) Prefix tuning k=20 (0.37M) Linear head only (0.0015M)

Show explanation

Correct order: Full (110M) > LoRA (0.59M) > Prefix (0.37M) > Head only (0.0015M = 768 × 2 for binary classification).

Performance generally follows the same order, but with diminishing returns: LoRA with 0.5% of parameters typically reaches 95-99% of full fine-tuning quality. The linear head alone can work for simple tasks where the pretrained representations already capture what you need.

Exercise 7.5: Implement LoRA Forward Pass Build

Write a function that computes the LoRA-modified output: y = Wx + BAx, where W is the frozen weight, B and A are low-rank adapter matrices.

Use nested loops for matrix-vector multiplication. W[i][j] * x[j] etc.

Show solution

javascript
function loraForward(x, W, A, B) {
  const dOut = W.length, dIn = x.length, r = A.length;
  // Compute Wx
  const Wx = Array(dOut).fill(0);
  for (let i = 0; i < dOut; i++)
    for (let j = 0; j < dIn; j++)
      Wx[i] += W[i][j] * x[j];
  // Compute Ax (r-dim)
  const Ax = Array(r).fill(0);
  for (let i = 0; i < r; i++)
    for (let j = 0; j < dIn; j++)
      Ax[i] += A[i][j] * x[j];
  // Compute BAx (dOut-dim)
  const BAx = Array(dOut).fill(0);
  for (let i = 0; i < dOut; i++)
    for (let j = 0; j < r; j++)
      BAx[i] += B[i][j] * Ax[j];
  // y = Wx + BAx
  return Wx.map((v, i) => v + BAx[i]);
}

Exercise 7.6: LoRA vs Full FT Memory Derive

For a 7B parameter model: compare memory for (A) full fine-tuning (16 bytes/param) vs (B) LoRA r=16 on Q,V with d=4096, L=32 (frozen base at 2 bytes/param + 16 bytes/param for LoRA weights only). What is the memory ratio B/A?

ratio (B/A)

Show derivation

(A) Full FT: 7B × 16 = 112 GB

(B) LoRA trainable: 2 × 2 × 4096 × 16 × 32 = 8,388,608 ≈ 8.4M params

(B) Memory: 7B × 2 (frozen) + 8.4M × 16 (LoRA optimizer) = 14.0 GB + 0.13 GB ≈ 14.1 GB

Ratio = 14.1 / 112 ≈ 0.126

LoRA uses ~12.5% of the memory of full fine-tuning — making it feasible to fine-tune a 7B model on a single 24GB GPU (14 GB for weights + a few GB for activations). Full fine-tuning would need at least 2x A100 80GB just for the optimizer states.

Chapter 8: Evaluation Metrics

Your model generates text that looks reasonable to you, but you need to convince your team it's actually good. NLP evaluation metrics provide automatic, reproducible scores. Each metric captures a different aspect of quality: BLEU measures precision of n-gram overlap (machine translation), ROUGE measures recall of n-gram overlap (summarization), and F1 measures exact token overlap (extractive QA).

BLEU and ROUGE are opposites. BLEU is precision-based: "of the words the model generated, how many match the reference?" Good for translation where you want the output to be accurate. ROUGE is recall-based: "of the words in the reference, how many does the model capture?" Good for summarization where you want coverage of key information.

Exercise 8.1: BLEU-1 (Unigram Precision) Derive

Reference: "the cat is on the mat". Hypothesis: "the the the the the the". Compute the naive unigram precision (# matching unigrams / # hypothesis unigrams). Then compute the clipped unigram precision (each word's match count is clipped to its count in the reference).

clipped unigram precision

Show derivation

Naive precision: All 6 words in the hypothesis are "the", and "the" appears in the reference. So naive precision = 6/6 = 1.0. This is clearly wrong — the hypothesis is terrible!

Naive: 6 matches / 6 hypothesis words = 1.0 (broken!)

Clipped precision: "the" appears 2 times in the reference, so we clip the match count to 2:

Clipped: min(6, 2) / 6 = 2/6 = 0.333

Clipping prevents gaming by repeating high-frequency words. BLEU always uses clipped counts. This is why BLEU is called "modified n-gram precision" — the modification is this clipping step.

Exercise 8.2: BLEU-2 (Bigram Precision) Derive

Reference: "the cat sat on the mat". Hypothesis: "the cat on the mat". Compute the clipped bigram precision (2-gram precision).

Hypothesis bigrams: (the,cat), (cat,on), (on,the), (the,mat). Reference bigrams: (the,cat), (cat,sat), (sat,on), (on,the), (the,mat).

bigram precision

Show derivation

Hypothesis bigrams: {(the,cat), (cat,on), (on,the), (the,mat)} — 4 bigrams

Reference bigrams: {(the,cat), (cat,sat), (sat,on), (on,the), (the,mat)}

Matches: (the,cat)✓, (cat,on)✗, (on,the)✓, (the,mat)✓

Clipped bigram precision = 3/4 = 0.75

The hypothesis correctly captures 3 of 4 bigrams but misses (cat,sat) because "sat" was omitted. The missing word "sat" breaks one bigram pair. Higher-order n-grams are more sensitive to word order and fluency — BLEU-4 (the standard) uses up to 4-grams.

Exercise 8.3: ROUGE-L (LCS) Derive

Reference: "police killed the gunman". Hypothesis: "police kill the gunman". What is the length of the longest common subsequence (LCS)? Then compute ROUGE-L F1.

LCS does NOT require consecutive tokens — subsequence allows gaps.

ROUGE-L F1

Show derivation

Reference: [police, killed, the, gunman] (4 words)

Hypothesis: [police, kill, the, gunman] (4 words)

LCS: "police", "the", "gunman" = 3 words

("killed" and "kill" don't match as exact tokens)

Precision = LCS / |hyp| = 3/4 = 0.75

Recall = LCS / |ref| = 3/4 = 0.75

F1 = 2 × 0.75 × 0.75 / (0.75 + 0.75) = 0.75

ROUGE-L F1 = 0.75. The score is penalized because "killed" vs "kill" is an exact token mismatch. Some ROUGE variants use stemming to handle this (both stem to "kill"), which would give LCS=4 and perfect F1=1.0. This shows why metric choice matters — stemming, tokenization, and case sensitivity all affect the final score.

Exercise 8.4: Extractive QA F1 Derive

Question: "When was the Eiffel Tower built?" Gold answer: "1887 to 1889". Predicted answer: "built in 1889". Compute the token-level F1 score.

F1 score

Show derivation

Gold tokens: {1887, to, 1889} — 3 tokens

Pred tokens: {built, in, 1889} — 3 tokens

Overlap: {1889} — 1 token

Precision = 1/3 = 0.333

Recall = 1/3 = 0.333

F1 = 2 × 0.333 × 0.333 / (0.333 + 0.333) = 0.333

F1 = 0.33, which seems harsh — the model got the year right but also extracted extra tokens ("built in"). In SQuAD evaluation, exact match (EM) would be 0 (not an exact match) but F1 captures partial credit. A predicted answer of just "1889" would get Precision=1.0, Recall=1/3, F1=0.5 — still penalized for missing "1887 to." The ideal answer is the exact span "1887 to 1889."

Exercise 8.5: Implement BLEU-1 Build

Write a function that computes clipped unigram precision (BLEU-1 without brevity penalty).

Count each word's occurrences in both, clip hypothesis count to reference count.

Show solution

javascript
function bleu1(hypothesis, reference) {
  // Count words in reference
  const refCount = {};
  for (const w of reference)
    refCount[w] = (refCount[w] || 0) + 1;
  // Count clipped matches
  const hypCount = {};
  let clippedMatches = 0;
  for (const w of hypothesis) {
    hypCount[w] = (hypCount[w] || 0) + 1;
  }
  for (const w in hypCount) {
    clippedMatches += Math.min(hypCount[w], refCount[w] || 0);
  }
  return hypothesis.length > 0 ? clippedMatches / hypothesis.length : 0;
}

Exercise 8.6: Brevity Penalty Derive

The BLEU brevity penalty is BP = min(1, exp(1 - r/c)) where r = reference length, c = hypothesis length. If the reference has 20 words and the hypothesis has 12 words, what is BP?

brevity penalty

Show derivation

BP = min(1, exp(1 - r/c)) = min(1, exp(1 - 20/12))

= min(1, exp(1 - 1.667)) = min(1, exp(-0.667))

= min(1, 0.513) = 0.513

The brevity penalty halves the BLEU score! This prevents a strategy of generating short, confident outputs. Without BP, a model could output a single highly-probable word and achieve perfect precision. BP kicks in whenever the hypothesis is shorter than the reference (c < r). When c ≥ r, BP = 1 (no penalty).

Chapter 9: Capstone: Build a Text Classifier

You've been hired to build a sentiment classifier for product reviews. You have 10,000 labeled training examples (positive/negative), a budget of $200 for GPU time, and a deadline of 3 days. Time to make engineering decisions — not every choice involves a novel algorithm, but every choice has cost, quality, and time implications.

This chapter ties everything together. You'll make decisions about tokenization, model selection, training configuration, and cost estimation. There's no single right answer — but there are wrong answers that waste your budget or miss your deadline.

Exercise 9.1: Tokenizer Choice Trace

Your reviews are in English with occasional emoji and product model numbers (e.g., "iPhone 15 Pro Max"). Which tokenization approach should you use?

Word-level tokenization with a 50K vocabulary Character-level tokenization for maximum coverage Use the pretrained tokenizer that comes with BERT/RoBERTa — it's BPE/WordPiece, handles OOV gracefully, and matches what the model was pretrained on Train a custom BPE tokenizer on your 10K reviews

Show reasoning

Always use the tokenizer that matches your pretrained model. BERT's WordPiece and RoBERTa's BPE handle OOV words, model numbers, and most emoji. Training a custom tokenizer on 10K examples would be vastly inferior to the pretrained tokenizer trained on billions of words. Character-level would work but makes sequences 4-5x longer, increasing compute. Word-level would choke on model numbers and rare product names.

Exercise 9.2: Model Selection Trace

For 10K training examples and a binary classification task (sentiment), which approach makes the most sense?

Train GPT-4 scale model from scratch — more parameters = better Fine-tune BERT-base (110M) or RoBERTa-base — right size for the data, great pretrained representations Use a bag-of-words + logistic regression — deep learning is overkill Fine-tune LLaMA 70B with LoRA — bigger is always better

Show reasoning

With 10K examples, BERT-base or RoBERTa-base is the sweet spot. Training from scratch requires millions of examples. A 70B model with LoRA would work but is overkill — the marginal improvement over BERT-base on binary sentiment is minimal, while compute cost is 10-50x higher. Bag-of-words actually works reasonably (85-88%) but BERT easily reaches 93-95% with fine-tuning, and the compute cost is minimal ($1-5 on a single GPU for a few hours).

Exercise 9.3: Fine-tuning FLOPs Derive

Fine-tuning BERT-base (110M params) on 10K examples, sequence length 128, for 3 epochs. Approximate total FLOPs using the rule: FLOPs ≈ 6 × N × D where N = params and D = total tokens processed.

TFLOPs (10¹²)

Show derivation

D = 10,000 examples × 128 tokens × 3 epochs = 3,840,000 tokens

FLOPs = 6 × 110 × 10⁶ × 3,840,000

= 6 × 110 × 10⁶ × 3.84 × 10⁶

= 2,534 × 10¹² ≈ 2,534 TFLOPs

2,534 TFLOPs. An A100 delivers ~312 TFLOPS (BF16 with tensor cores). At 50% utilization: 2,534 / (312 × 0.5) = ~16 seconds of A100 time. This is why BERT fine-tuning is so cheap — even on consumer GPUs (RTX 3090 at 71 TFLOPS), it takes under 2 minutes. The $200 budget is wildly more than needed.

Exercise 9.4: Memory Budget Derive

BERT-base (110M params) with full fine-tuning (16 bytes/param for optimizer). Activation memory for batch_size=32, seq_len=128, d=768, L=12 layers is approximately 32 × 128 × 768 × 12 × 2 bytes. What is the total GPU memory needed (weights+optimizer+activations)?

Show derivation

Weights + optimizer = 110M × 16 = 1,760 MB = 1.76 GB

Activations ≈ 32 × 128 × 768 × 12 × 2 = 75,497,472 bytes ≈ 75 MB

Total ≈ 1.76 GB + 0.075 GB + ~0.7 GB overhead ≈ 2.53 GB

~2.5 GB fits on any GPU made in the last 5 years (even a 4GB GTX 1050 would work with smaller batch size). This is why BERT fine-tuning democratized NLP — you don't need expensive hardware. The real cost is not compute but data labeling: 10,000 labeled examples at $0.10/label = $1,000, far more than the $5 of GPU time.

Exercise 9.5: Design the Pipeline Design

Put these fine-tuning steps in the correct order:

→

Tokenize with pretrained tokenizer Load pretrained BERT + add classification head Fine-tune (3 epochs, lr=2e-5, AdamW) Evaluate on held-out test set Split data: 80% train, 10% val, 10% test

Show explanation

Correct order: Split data → Tokenize → Load pretrained model → Fine-tune → Evaluate

You must split before tokenizing to prevent data leakage. The model is loaded after tokenization because you need the tokenizer to prepare inputs. Fine-tuning uses the validation set for early stopping. Final evaluation on the test set happens only once — if you tune hyperparameters based on test performance, you're overfitting to the test set.

Exercise 9.6: Cost Estimation Derive

An A100 on AWS costs ~$3.00/hour (on-demand). Your fine-tuning takes ~16 seconds (from Exercise 9.3). Including hyperparameter search (5 runs) and evaluation, estimate total GPU hours and cost.

dollars (USD)

Show derivation

5 training runs × 16 seconds = 80 seconds

Evaluation overhead ≈ 20 seconds

Data loading/preprocessing ≈ 60 seconds

Total GPU time ≈ 160 seconds ≈ 0.044 hours

Cost = 0.044 × $3.00 = $0.13

Under $1 total! With the $200 budget, you could run 1,500 experiments. In practice, you'd spend more on the minimum billing increment (most cloud providers charge per-minute or per-second with a minimum). The real lesson: BERT-base fine-tuning on small datasets is essentially free. The engineering time to set up the pipeline (a few hours of your salary) vastly exceeds the compute cost.

Exercise 9.7: Find the Bug Debug

Your colleague's training code produces suspiciously high accuracy (99.8%) on the test set. Click the buggy line.

data = load_reviews('reviews.csv')
tokenized = tokenize(data)
model = BertForSequenceClassification.from_pretrained('bert-base')
model.fit(tokenized, epochs=3, lr=2e-5)
accuracy = model.evaluate(tokenized)  # eval on training data!
print(f'Test accuracy: {accuracy}')

Show explanation

Line 5 is the bug. The code evaluates on the same tokenized data used for training — there's no train/test split! The model memorized the training examples and the 99.8% accuracy is meaningless. The fix: split data into train/val/test before training, train on train, use val for early stopping, and evaluate on test only once at the end. Without this split, you have no idea how the model performs on unseen data.

NLP Fundamentals

Chapter 0: Word Vectors

Chapter 1: Word2Vec Gradients

Chapter 2: Dependency Parsing

Chapter 3: Language Model Perplexity

Chapter 4: Attention Mechanisms

Chapter 5: Subword Tokenization

Chapter 6: Pretraining Objectives

Chapter 7: Fine-tuning & PEFT

Chapter 8: Evaluation Metrics

Chapter 9: Capstone: Build a Text Classifier