From word vectors to text classifiers — the core NLP toolkit derived by hand. Word2Vec, parsing, attention, tokenization, pretraining, fine-tuning, and evaluation — all solvable in-browser with instant feedback.
How do you tell a computer that "king" and "queen" are related but "king" and "toaster" are not? You can't just alphabetize — "cat" and "car" are close alphabetically but worlds apart semantically. The insight behind word vectors is to represent each word as a point in a high-dimensional space where distance equals meaning.
In Word2Vec (skipgram), you train a model to predict context words from a center word. The key idea: if two words appear in similar contexts, they end up with similar vectors. The training window defines "context" — for window size w, context words are the w words before and after the center word.
With window=2, we take 2 words to the left and 2 words to the right of the center word "sat" (position 3).
Note: the center word is never included in its own context. Also note "the" appears twice — once from the left, once from the right. In practice Word2Vec treats these as separate training pairs: (sat, the), (sat, cat), (sat, on), (sat, the).
Given vectors: vcat = [0.5, 0.8, -0.2], vdog = [0.4, 0.9, -0.1]. Compute the dot product vcat · vdog.
A positive dot product means the vectors point in roughly the same direction — the words are semantically similar. Compare this to vcat · vtoaster which would be near zero for unrelated words.
Using the same vectors: vcat = [0.5, 0.8, -0.2], vdog = [0.4, 0.9, -0.1]. Compute the cosine similarity (round to 2 decimal places).
Compute ||vcat|| = √(0.25 + 0.64 + 0.04), ||vdog|| = √(0.16 + 0.81 + 0.01), then divide dot product by the product of norms.
Cosine similarity of 0.98 — extremely similar! Cosine similarity normalizes for vector length, so it measures direction only. Two vectors with cosine similarity = 1.0 point the same way regardless of magnitude. This is why NLP prefers cosine over raw dot product for comparing embeddings.
Write a function that takes two arrays (vectors) of equal length and returns their cosine similarity.
javascript function cosineSimilarity(a, b) { let dot = 0, normA = 0, normB = 0; for (let i = 0; i < a.length; i++) { dot += a[i] * b[i]; normA += a[i] * a[i]; normB += b[i] * b[i]; } return dot / (Math.sqrt(normA) * Math.sqrt(normB)); }
Given: vking = [0.8, 0.6, 0.9], vman = [0.3, 0.1, 0.7], vwoman = [0.3, 0.5, 0.7]. Compute vresult = vking - vman + vwoman. What is the first component of vresult?
The result [0.8, 1.0, 0.9] should be closest to vqueen in the embedding space. Notice the first component stayed at 0.8 (the "royalty" axis), while the second component jumped from 0.6 to 1.0 (the "femininity" axis). This is the geometric magic of word vector arithmetic: king - man + woman ≈ queen.
Sentence: "I love deep learning models". Window size = 2. How many total (center, context) training pairs does Word2Vec skipgram generate?
Count each (center, context) pair. Edge words have fewer context words since the window gets clipped at sentence boundaries.
For each word, count how many context words are within the window:
"love" gets only 3 context words (not 4) because the window clips at the left boundary — there's only 1 word to its left. Edge effects reduce total pairs. The general formula: for a sentence of length n with window m, each word at position i produces min(m, i-1) + min(m, n-i) pairs.
Word2Vec learns by gradient descent, so we need to understand the objective function and its gradients. The skipgram model defines the probability of observing a context word o given center word c using the softmax over the entire vocabulary:
Vocabulary: {a, b, c, d, e}. Center word "c" with vc = [1, 0]. Context vectors: ua=[1,1], ub=[0,1], uc=[1,0], ud=[-1,0], ue=[0,-1]. Compute P(a | c).
First compute all dot products uw · vc, then apply softmax.
Words with higher dot product to the center word get higher probability. ua and uc both have dot product 1, so they share the highest probability. ud has dot product -1, so it gets the lowest probability (0.047). This is exactly what we want: the model says "a" and "c" are likely context words for center word "c".
The denominator ∑w∈V exp(uw · vc) requires computing the dot product of vc with every context vector uw in the vocabulary. For V = 50,000, that's 50,000 dot products per training pair. With billions of training pairs, this is prohibitively expensive. This is the softmax bottleneck — and it's exactly why negative sampling was invented.
Negative sampling replaces the full softmax with a binary classification. Instead of normalizing over V words, we maximize log σ(uo · vc) + ∑k log σ(-uk · vc) for k negative samples. Given: uo · vc = 2.0, and 2 negative samples with uk1 · vc = -0.5, uk2 · vc = 0.3. Compute the negative sampling loss JNS = -[log σ(2.0) + log σ(0.5) + log σ(-0.3)].
σ(x) = 1/(1+e-x). So σ(2.0)≈0.881, σ(0.5)≈0.622, σ(-0.3)≈0.426.
Positive term: σ(uo · vc) = σ(2.0). Negative terms use σ(-uk · vc):
The key subtlety: negative samples use σ(-uk · vc), which encourages high sigmoid output when the dot product is negative (i.e., the words are dissimilar). With just k=2 negatives instead of V=50,000 terms in the softmax denominator, training is orders of magnitude faster.
Using the 5-word vocabulary from Exercise 1.1, with the observed context word being "a": compute the gradient ∂J/∂vc = -uo + ∑w P(w|c) · uw. What is the first component of the gradient vector?
Use the probabilities: P(a|c)=0.348, P(b|c)=0.128, P(c|c)=0.348, P(d|c)=0.047, P(e|c)=0.128. The expected vector E[uw] = ∑ P(w|c) · uw.
First compute E[uw] (first component only):
The gradient is negative in the first component, meaning the update will increase vc in that direction — pushing vc toward the observed context word ua. The gradient literally says: "move the center vector closer to the word you actually observed."
This is why negative sampling made Word2Vec practical. The original paper (Mikolov et al., 2013) reported that k=5-20 negative samples works well for small datasets, while k=2-5 suffices for very large corpora. The quality barely degrades because most words in the vocabulary are irrelevant to any given center word — you only need a small sample of negatives to approximate the full gradient.
This negative sampling loss implementation has a subtle bug. Click the buggy line.
function negSamplingLoss(vc, uo, negVecs) { let loss = -Math.log(sigmoid(dot(uo, vc))); for (let k = 0; k < negVecs.length; k++) { loss -= Math.log(sigmoid(dot(negVecs[k], vc))); } return loss; }
Line 4 is the bug. For negative samples, we want σ(-uk · vc), not σ(uk · vc). The correct line should be: loss -= Math.log(sigmoid(-dot(negVecs[k], vc)));
Without the negation, the loss encourages high similarity with negative samples instead of discouraging it. The model would learn to make all words similar — the exact opposite of what we want.
How does a computer figure out that in "the cat sat on the mat," the word "sat" is the main action, "cat" is the one doing it, and "mat" is where it happened? Dependency parsing builds a tree that connects each word to the word it modifies. The result is a directed graph where every word except the root has exactly one parent.
The shift-reduce parser uses three operations to build this tree incrementally. It maintains a stack (words being processed), a buffer (words waiting), and a set of arcs (the dependency edges found so far).
Each SHIFT moves the front word from the buffer onto the top of the stack. The stack grows from left to right — the rightmost element is the "top."
LEFT-ARC: the top of stack (cat) becomes the parent of the second-on-stack (The). The second-on-stack (The) is removed. The arc is cat → The (parent → child), labeled "det" for determiner.
Think of LEFT-ARC as: "the word below me on the stack depends on me." The arrow points from parent to child: cat → The. This correctly says "The" is a determiner of "cat."
"The cat sat on the mat" has 6 words. How many total transitions (SHIFT + ARC operations) does the shift-reduce parser need?
Every word must be shifted once (6 shifts) and every word must be attached to exactly one parent (6 arcs, one for each word including ROOT attachment). This gives exactly 2n transitions. A neural parser with a scoring network that runs in O(1) per transition thus parses in O(n) total time — linear in sentence length.
Put these transitions in the correct order to parse "The cat sat":
Correct order: SHIFT "The" → SHIFT "cat" → LEFT-ARC(det) → SHIFT "sat" → LEFT-ARC(nsubj) → RIGHT-ARC(root)
The final tree: ROOT → sat, sat → cat (nsubj), cat → The (det). This correctly captures that "sat" is the main verb, "cat" is the subject, and "The" modifies "cat."
Shift-reduce: O(n) = 80 steps. Graph-based (Eisner/MST): O(n²) or O(n³) = 1,600 or 64,000 steps. Shift-reduce wins on speed, but graph-based can find the globally optimal tree while shift-reduce makes greedy local decisions. In practice, beam search (keeping top-k candidates at each step) helps shift-reduce approach global optimality while staying fast.
You've trained two language models and both look reasonable. How do you decide which one is better? You can't just eyeball their generated text — you need a number. Perplexity is that number. It measures how "surprised" the model is by real text. A model that assigns high probability to actual sentences has low perplexity.
A language model assigns these probabilities to a 5-word sentence: P(w1)=0.1, P(w2|w1)=0.3, P(w3|w1:2)=0.5, P(w4|w1:3)=0.2, P(w5|w1:4)=0.4. Compute the perplexity.
Alternative computation via the product formula: (0.1 × 0.3 × 0.5 × 0.2 × 0.4)-1/5 = (0.0012)-0.2 = (1/0.0012)0.2 = 833.30.2 ≈ 3.84. Both methods give the same answer. The model is about as confused as picking from ~4 equally likely options.
Your language model reports a validation cross-entropy loss of 2.8 (nats, i.e. natural log base). What is the perplexity?
This is the most common conversion you'll do in practice. Training loss of 2.8 → perplexity of ~16.4. For reference: GPT-2 achieved ~29.4 perplexity on WikiText-103, GPT-3 improved to ~20.5, and current frontier models are in the 5-10 range on similar benchmarks.
Lower perplexity = better model. Model B (PP=15.3) assigns higher probability to the test text than Model A (PP=22.1). Concretely: at each word position, Model B is as uncertain as choosing from ~15 options, while Model A is as uncertain as choosing from ~22 options. Model B has a tighter, more accurate distribution over next words.
Caveat: perplexity must be compared on the same test set with the same vocabulary. A model with vocabulary size 1000 will naturally have lower perplexity than one with vocabulary 50000 on the same text, simply because it has fewer choices.
Write a function that takes an array of token probabilities and returns the perplexity.
javascript function perplexity(probs) { const N = probs.length; let sumLogP = 0; for (let i = 0; i < N; i++) { sumLogP += Math.log(probs[i]); } return Math.exp(-sumLogP / N); }
A uniform language model assigns probability 1/V to every word regardless of context, where V is the vocabulary size. If V = 50,000, what is the perplexity of this model?
The uniform model has perplexity exactly equal to the vocabulary size. This is the worst-case baseline — any model that learns anything about language should beat this. For V=50,000, the model is maximally confused, picking randomly from all 50,000 words at each position. A perplexity of 15 means the model has reduced its effective choices from 50,000 to 15 — a 3,333× improvement.
Before transformers, seq-to-seq models compressed the entire input sentence into a single fixed-length vector. For long sentences, this bottleneck destroyed information. Attention (Bahdanau et al., 2014) lets the decoder look back at all encoder states and focus on the most relevant ones for each output word.
Three encoder states (source words): h1=[1,0], h2=[0,1], h3=[1,1]. One decoder state: s=[0.5, 0.8]. Using dot-product attention (Luong, no W matrix), compute the alignment score e1 = s · h1.
Using the same vectors, compute all three alignment scores and then the attention weights α1, α2, α3 via softmax. What is α3 (the weight on h3=[1,1])?
h3 gets the highest attention weight (~49%) because it has the largest dot product with the decoder state. This makes sense: h3=[1,1] is the most aligned with s=[0.5,0.8] — it's the source word the decoder "wants" to look at most.
Using the attention weights from 4.2 (α1≈0.219, α2≈0.295, α3≈0.486), compute the context vector c = ∑ αjhj. What is the first component of c?
The context vector is a weighted average of the encoder states. Since h3 has the highest weight, c is pulled strongly toward h3=[1,1]. This vector summarizes "what the decoder should know about the input" for generating the current output word.
Bahdanau uses a tanh nonlinearity, so it can capture non-linear relationships between query and key. But this prevents efficient batching — you can't compute all scores with a single matrix multiply. Luong's dot product can be computed as S × HT, a single GEMM call that GPUs love. This is why transformers chose scaled dot-product attention: QKT/√dk is just Luong attention with a scaling factor to prevent softmax saturation.
"chat" is the French word for "cat," so it should receive the highest attention weight when generating "cat." This is exactly what attention was designed for: it learns to align source and target words. In French-to-English translation, attention handles word reordering naturally — "Le chat noir" becomes "The black cat" by attending to the right word at each step, regardless of position differences between the languages.
Transformers use scaled dot-product: score = q · k / √dk. If dk = 64 and q · k = 16, what is the scaled score?
Without scaling, dot products grow proportionally to dk (expected magnitude ≈ √dk for random vectors). Large dot products push softmax into saturated regions where gradients vanish. Dividing by √dk keeps the variance at 1, regardless of dimension. The original "Attention Is All You Need" paper (Vaswani et al., 2017) explains this in Section 3.2.1.
Word-level tokenization has a fatal flaw: what happens when the model encounters a word it's never seen? Out-of-vocabulary (OOV) words are replaced with a generic [UNK] token, destroying meaning. "Unforgettable" becomes [UNK]. The solution is subword tokenization: break words into smaller pieces that cover all possible words.
Byte-Pair Encoding (BPE) starts with individual characters and iteratively merges the most frequent pair. After training, the vocabulary contains a mix of characters, subwords ("un", "##ing"), and full words ("the").
BPE always merges the most frequent pair. (a,b) appears 8 times — more than any other pair. After merging, the new token "ab" replaces all occurrences of "a b" in the corpus. The vocabulary becomes {a, b, c, d, _, ab} with 6 tokens.
You start with a character vocabulary of 256 (all bytes). After performing 32,000 merge operations, what is the final vocabulary size?
Each merge creates exactly one new token (the concatenation of the merged pair) without removing the original characters. So the vocabulary grows by exactly 1 per merge. This is how you choose your target vocabulary size: set k = desired_V - initial_chars. LLaMA's 32K vocabulary used approximately 32,000 - 256 = 31,744 merges.
Corpus: "low lower lowest" with word frequencies: "low"=5, "lower"=2, "lowest"=1. Initial tokens (with end-of-word ¤): l o w ¤ (freq 8), e r ¤ (freq 2), e s t ¤ (freq 1). After 3 merge operations, how many tokens are in the vocabulary? Start from initial character set: {l, o, w, e, r, s, t, ¤} = 8 tokens.
Count pair frequencies across all words (weighted by word frequency):
BPE's key advantage is that it cannot produce [UNK]. Any string can be decomposed into known subwords — in the worst case, into individual bytes. The model can still compose meaning from pieces: "Chat" (conversation), "GPT" (common in AI context), "4" (number), "o" (suffix). Word-level models are helpless with OOV words, making them brittle against new terminology, typos, and morphological variation.
Compare the embedding table size for: (A) word-level with V=100,000, d=768 vs (B) BPE with V=32,000, d=768. How many fewer parameters does BPE use (in millions)?
52M parameters saved just in the embedding table! For GPT-2 Small (124M total), this is ~42% of the entire model. Smaller vocabulary also speeds up the output softmax layer. The tradeoff: BPE sequences are ~20-30% longer than word-level (e.g., "unforgettable" = 1 word-token vs 3 BPE tokens), increasing compute per sentence.
The magic of modern NLP is pretraining: train a huge model on a vast corpus with a self-supervised objective, then fine-tune on your specific task. The two dominant objectives are Masked Language Modeling (MLM) used by BERT and Causal Language Modeling (CLM) used by GPT.
BERT processes a sequence of 512 tokens with 15% masking. How many tokens does the model predict per sequence? And of those, how many see [MASK] vs random token vs original?
Only 77 out of 512 positions generate a loss signal — that's 15% efficiency. CLM generates loss at all 512 positions, making it 6.7x more data-efficient per sequence. This is one reason GPT-style models eventually overtook BERT: they extract more learning from each training example.
CLM predicts at all T positions. MLM predicts at ~15% of T positions. For a training corpus of 300B tokens, how many prediction events does each objective produce?
CLM gets 6.7x more training signal from the same data. This is misleading though — MLM predictions see bidirectional context, which is arguably a richer learning signal per prediction. The field has debated this tradeoff extensively. Chinchilla (Hoffmann et al., 2022) showed that for CLM, optimal training uses roughly 20 tokens per parameter. For a 7B model, that's ~140B tokens.
The core issue is pretrain-finetune mismatch. During fine-tuning, the model never sees [MASK] tokens — all inputs are real words. If the model only learned to predict from [MASK] tokens, it would struggle with real inputs. The 10% "keep original" and 10% "random token" ensure the model also learns to predict from non-[MASK] positions, reducing the distribution shift at fine-tuning time. The BERT paper (Devlin et al., 2019) discusses this in Section 3.1.
This MLM masking function has a bug. The model trains fine but downstream performance is poor. Click the buggy line.
function applyMLMMask(tokens, maskProb) { const labels = tokens.slice(); for (let i = 0; i < tokens.length; i++) { if (Math.random() < maskProb) { tokens[i] = MASK_TOKEN; // always [MASK] } else { labels[i] = -100; // ignore in loss } } return { tokens, labels }; }
Line 5 is the bug. All selected positions are replaced with [MASK], ignoring the 80/10/10 split. The correct implementation should: 80% of the time use [MASK], 10% use a random token, and 10% keep the original. Without this split, the model overfits to the [MASK] token distribution and suffers from pretrain-finetune mismatch. The fix:
javascript const r = Math.random(); if (r < 0.8) tokens[i] = MASK_TOKEN; else if (r < 0.9) tokens[i] = randomToken(); // else: keep original (10%)
If position t can see position t+1, then when predicting wt+1, the model already has wt+1 in its context — it just copies the answer instead of learning to predict. The causal mask enforces the autoregressive property: each prediction only uses past tokens. This is also what makes CLM naturally suited for generation: at inference time, the model generates one token at a time, always conditioning only on previous tokens, which exactly matches training.
You have a pretrained BERT-base (110M parameters) and want to classify movie reviews as positive or negative. Full fine-tuning updates all 110M parameters — effective but expensive. Parameter-Efficient Fine-Tuning (PEFT) methods freeze most parameters and only train a tiny fraction, achieving similar quality at a fraction of the cost.
BERT-base has 110M parameters. Full fine-tuning with mixed-precision Adam requires ~16 bytes/param. How much GPU memory just for weights + optimizer? (Ignore activations.)
1.76 GB fits easily on any modern GPU. But for a 7B model: 7 × 109 × 16 = 112 GB — that's more than a single A100 80GB. This is why PEFT methods like LoRA are essential for large models. They reduce the 16 bytes/param down to ~2 bytes/param for frozen weights + 16 bytes/param for just the tiny adapter.
Apply LoRA with rank=16 to Q and V projections of BERT-base (d=768, L=12 layers). Each adapter pair has shapes A:[d,r] + B:[r,d]. How many trainable parameters total?
Only 0.54% of parameters are trainable! The other 99.46% are frozen, requiring only 2 bytes/param (for inference). LoRA memory: frozen weights (110M × 2 = 220 MB) + trainable weights+optimizer (0.59M × 16 = 9.4 MB) = ~230 MB total. Compare to 1.76 GB for full fine-tuning.
Prefix tuning prepends k learnable "virtual tokens" to the key and value sequences at each layer. With k=20 virtual tokens, d=768, L=12 layers. How many trainable parameters?
Each layer gets k virtual key vectors [k, d] and k virtual value vectors [k, d].
Even fewer than LoRA! But prefix tuning has a downside: the k virtual tokens consume positions in the sequence length budget. With k=20 and max sequence 512, you effectively lose 20 positions (3.9%). For long-context tasks, this matters. LoRA doesn't consume any sequence positions.
Order these fine-tuning methods from most to fewest trainable parameters (for BERT-base, d=768, L=12):
Correct order: Full (110M) > LoRA (0.59M) > Prefix (0.37M) > Head only (0.0015M = 768 × 2 for binary classification).
Performance generally follows the same order, but with diminishing returns: LoRA with 0.5% of parameters typically reaches 95-99% of full fine-tuning quality. The linear head alone can work for simple tasks where the pretrained representations already capture what you need.
Write a function that computes the LoRA-modified output: y = Wx + BAx, where W is the frozen weight, B and A are low-rank adapter matrices.
javascript function loraForward(x, W, A, B) { const dOut = W.length, dIn = x.length, r = A.length; // Compute Wx const Wx = Array(dOut).fill(0); for (let i = 0; i < dOut; i++) for (let j = 0; j < dIn; j++) Wx[i] += W[i][j] * x[j]; // Compute Ax (r-dim) const Ax = Array(r).fill(0); for (let i = 0; i < r; i++) for (let j = 0; j < dIn; j++) Ax[i] += A[i][j] * x[j]; // Compute BAx (dOut-dim) const BAx = Array(dOut).fill(0); for (let i = 0; i < dOut; i++) for (let j = 0; j < r; j++) BAx[i] += B[i][j] * Ax[j]; // y = Wx + BAx return Wx.map((v, i) => v + BAx[i]); }
For a 7B parameter model: compare memory for (A) full fine-tuning (16 bytes/param) vs (B) LoRA r=16 on Q,V with d=4096, L=32 (frozen base at 2 bytes/param + 16 bytes/param for LoRA weights only). What is the memory ratio B/A?
LoRA uses ~12.5% of the memory of full fine-tuning — making it feasible to fine-tune a 7B model on a single 24GB GPU (14 GB for weights + a few GB for activations). Full fine-tuning would need at least 2x A100 80GB just for the optimizer states.
Your model generates text that looks reasonable to you, but you need to convince your team it's actually good. NLP evaluation metrics provide automatic, reproducible scores. Each metric captures a different aspect of quality: BLEU measures precision of n-gram overlap (machine translation), ROUGE measures recall of n-gram overlap (summarization), and F1 measures exact token overlap (extractive QA).
Reference: "the cat is on the mat". Hypothesis: "the the the the the the". Compute the naive unigram precision (# matching unigrams / # hypothesis unigrams). Then compute the clipped unigram precision (each word's match count is clipped to its count in the reference).
Naive precision: All 6 words in the hypothesis are "the", and "the" appears in the reference. So naive precision = 6/6 = 1.0. This is clearly wrong — the hypothesis is terrible!
Clipped precision: "the" appears 2 times in the reference, so we clip the match count to 2:
Clipping prevents gaming by repeating high-frequency words. BLEU always uses clipped counts. This is why BLEU is called "modified n-gram precision" — the modification is this clipping step.
Reference: "the cat sat on the mat". Hypothesis: "the cat on the mat". Compute the clipped bigram precision (2-gram precision).
Hypothesis bigrams: (the,cat), (cat,on), (on,the), (the,mat). Reference bigrams: (the,cat), (cat,sat), (sat,on), (on,the), (the,mat).
The hypothesis correctly captures 3 of 4 bigrams but misses (cat,sat) because "sat" was omitted. The missing word "sat" breaks one bigram pair. Higher-order n-grams are more sensitive to word order and fluency — BLEU-4 (the standard) uses up to 4-grams.
Reference: "police killed the gunman". Hypothesis: "police kill the gunman". What is the length of the longest common subsequence (LCS)? Then compute ROUGE-L F1.
LCS does NOT require consecutive tokens — subsequence allows gaps.
ROUGE-L F1 = 0.75. The score is penalized because "killed" vs "kill" is an exact token mismatch. Some ROUGE variants use stemming to handle this (both stem to "kill"), which would give LCS=4 and perfect F1=1.0. This shows why metric choice matters — stemming, tokenization, and case sensitivity all affect the final score.
Question: "When was the Eiffel Tower built?" Gold answer: "1887 to 1889". Predicted answer: "built in 1889". Compute the token-level F1 score.
F1 = 0.33, which seems harsh — the model got the year right but also extracted extra tokens ("built in"). In SQuAD evaluation, exact match (EM) would be 0 (not an exact match) but F1 captures partial credit. A predicted answer of just "1889" would get Precision=1.0, Recall=1/3, F1=0.5 — still penalized for missing "1887 to." The ideal answer is the exact span "1887 to 1889."
Write a function that computes clipped unigram precision (BLEU-1 without brevity penalty).
javascript function bleu1(hypothesis, reference) { // Count words in reference const refCount = {}; for (const w of reference) refCount[w] = (refCount[w] || 0) + 1; // Count clipped matches const hypCount = {}; let clippedMatches = 0; for (const w of hypothesis) { hypCount[w] = (hypCount[w] || 0) + 1; } for (const w in hypCount) { clippedMatches += Math.min(hypCount[w], refCount[w] || 0); } return hypothesis.length > 0 ? clippedMatches / hypothesis.length : 0; }
The BLEU brevity penalty is BP = min(1, exp(1 - r/c)) where r = reference length, c = hypothesis length. If the reference has 20 words and the hypothesis has 12 words, what is BP?
The brevity penalty halves the BLEU score! This prevents a strategy of generating short, confident outputs. Without BP, a model could output a single highly-probable word and achieve perfect precision. BP kicks in whenever the hypothesis is shorter than the reference (c < r). When c ≥ r, BP = 1 (no penalty).
You've been hired to build a sentiment classifier for product reviews. You have 10,000 labeled training examples (positive/negative), a budget of $200 for GPU time, and a deadline of 3 days. Time to make engineering decisions — not every choice involves a novel algorithm, but every choice has cost, quality, and time implications.
Always use the tokenizer that matches your pretrained model. BERT's WordPiece and RoBERTa's BPE handle OOV words, model numbers, and most emoji. Training a custom tokenizer on 10K examples would be vastly inferior to the pretrained tokenizer trained on billions of words. Character-level would work but makes sequences 4-5x longer, increasing compute. Word-level would choke on model numbers and rare product names.
With 10K examples, BERT-base or RoBERTa-base is the sweet spot. Training from scratch requires millions of examples. A 70B model with LoRA would work but is overkill — the marginal improvement over BERT-base on binary sentiment is minimal, while compute cost is 10-50x higher. Bag-of-words actually works reasonably (85-88%) but BERT easily reaches 93-95% with fine-tuning, and the compute cost is minimal ($1-5 on a single GPU for a few hours).
Fine-tuning BERT-base (110M params) on 10K examples, sequence length 128, for 3 epochs. Approximate total FLOPs using the rule: FLOPs ≈ 6 × N × D where N = params and D = total tokens processed.
2,534 TFLOPs. An A100 delivers ~312 TFLOPS (BF16 with tensor cores). At 50% utilization: 2,534 / (312 × 0.5) = ~16 seconds of A100 time. This is why BERT fine-tuning is so cheap — even on consumer GPUs (RTX 3090 at 71 TFLOPS), it takes under 2 minutes. The $200 budget is wildly more than needed.
BERT-base (110M params) with full fine-tuning (16 bytes/param for optimizer). Activation memory for batch_size=32, seq_len=128, d=768, L=12 layers is approximately 32 × 128 × 768 × 12 × 2 bytes. What is the total GPU memory needed (weights+optimizer+activations)?
~2.5 GB fits on any GPU made in the last 5 years (even a 4GB GTX 1050 would work with smaller batch size). This is why BERT fine-tuning democratized NLP — you don't need expensive hardware. The real cost is not compute but data labeling: 10,000 labeled examples at $0.10/label = $1,000, far more than the $5 of GPU time.
Put these fine-tuning steps in the correct order:
Correct order: Split data → Tokenize → Load pretrained model → Fine-tune → Evaluate
You must split before tokenizing to prevent data leakage. The model is loaded after tokenization because you need the tokenizer to prepare inputs. Fine-tuning uses the validation set for early stopping. Final evaluation on the test set happens only once — if you tune hyperparameters based on test performance, you're overfitting to the test set.
An A100 on AWS costs ~$3.00/hour (on-demand). Your fine-tuning takes ~16 seconds (from Exercise 9.3). Including hyperparameter search (5 runs) and evaluation, estimate total GPU hours and cost.
Under $1 total! With the $200 budget, you could run 1,500 experiments. In practice, you'd spend more on the minimum billing increment (most cloud providers charge per-minute or per-second with a minimum). The real lesson: BERT-base fine-tuning on small datasets is essentially free. The engineering time to set up the pipeline (a few hours of your salary) vastly exceeds the compute cost.
Your colleague's training code produces suspiciously high accuracy (99.8%) on the test set. Click the buggy line.
data = load_reviews('reviews.csv') tokenized = tokenize(data) model = BertForSequenceClassification.from_pretrained('bert-base') model.fit(tokenized, epochs=3, lr=2e-5) accuracy = model.evaluate(tokenized) # eval on training data! print(f'Test accuracy: {accuracy}')
Line 5 is the bug. The code evaluates on the same tokenized data used for training — there's no train/test split! The model memorized the training examples and the 99.8% accuracy is meaningless. The fix: split data into train/val/test before training, train on train, use val for early stopping, and evaluate on test only once at the end. Without this split, you have no idea how the model performs on unseen data.