Code from Scratch Workbook — Engineermaxxing

Chapter 0: Softmax

Every transformer starts here. You have a vector of raw scores — logits — and you need to turn them into a probability distribution. That's softmax: exponentiate each value, then divide by the sum. Sounds trivial. Until your logits are [1000, 1001, 999] and Math.exp(1000) returns Infinity.

The trick is numerical stability: subtract the max before exponentiating. The math is identical (it factors out as a constant in numerator and denominator), but the largest exponent becomes e⁰ = 1 instead of e¹⁰⁰⁰ = boom.

softmax(x_i) = exp(x_i − max(x)) / ∑_j exp(x_j − max(x))

Why subtract the max? If the largest logit is M, then after subtraction the largest exponent is e⁰ = 1. All others are e^negative < 1. No overflow possible. The output is mathematically identical because exp(x_i − M) / ∑ exp(x_j − M) = exp(x_i) · e^−M / (∑ exp(x_j) · e^−M) = exp(x_i) / ∑ exp(x_j).

Exercise 0.1: Implement softmax Build

Write a numerically stable softmax. Input: array of logits. Output: array of probabilities that sum to 1.

Must handle large values without overflow. Output should sum to 1.0.

Show solution

javascript
function softmax(logits) {
  const max = Math.max(...logits);
  const exps = logits.map(x => Math.exp(x - max));
  const sum = exps.reduce((a, b) => a + b, 0);
  return exps.map(e => e / sum);
}

Exercise 0.2: Implement logSoftmax Build

Log-softmax is used everywhere in loss computation because log(softmax(x)) computed naively loses precision. Instead, compute it directly: log_softmax(x_i) = (x_i − max) − log(∑ exp(x_j − max)).

Output values should all be ≤ 0. exp(logSoftmax(x)) should equal softmax(x).

Show solution

javascript
function logSoftmax(logits) {
  const max = Math.max(...logits);
  const shifted = logits.map(x => x - max);
  const logSumExp = Math.log(shifted.reduce((a, x) => a + Math.exp(x), 0));
  return shifted.map(x => x - logSumExp);
}

Exercise 0.3: Why Subtract the Max? Trace

If you compute softmax([1000, 1001, 999]) without subtracting the max first, what happens?

You get slightly wrong probabilities due to floating-point rounding exp(1000) = Infinity, so numerator and denominator are both Infinity, giving NaN JavaScript throws an overflow exception The probabilities sum to more than 1.0

Exercise 0.4: Find the Bug Debug

This softmax implementation has a subtle bug. It works for small values but overflows for large logits. Click the buggy line.

function softmax(logits) {
  const exps = logits.map(x => Math.exp(x));
  const sum = exps.reduce((a, b) => a + b, 0);
  return exps.map(e => e / sum);
}

Show explanation

Line 2 is the bug. It exponentiates raw logits without subtracting the max first. For logits like [1000, 1001, 999], Math.exp(1000) returns Infinity. The fix: compute const max = Math.max(...logits) and use Math.exp(x - max).

Exercise 0.5: Softmax Temperature Build

Implement softmax with temperature: divide logits by T before applying softmax. T>1 makes the distribution more uniform (softer). T<1 makes it peakier (sharper). T→0 approaches argmax.

Show solution

javascript
function softmaxTemp(logits, T) {
  const scaled = logits.map(x => x / T);
  const max = Math.max(...scaled);
  const exps = scaled.map(x => Math.exp(x - max));
  const sum = exps.reduce((a, b) => a + b, 0);
  return exps.map(e => e / sum);
}

Chapter 1: Linear Layer

The linear layer is the workhorse of neural networks. It's just a matrix multiply plus a bias: y = xW + b. Every projection in a transformer — Q, K, V, output, FFN up, FFN down — is a linear layer. If you can implement matmul, you can implement everything else.

We work with 2D arrays where x[i][j] means row i, column j. Matrix multiply of A [m×n] and B [n×p] produces C [m×p] where C[i][j] = ∑_k A[i][k] · B[k][j].

A [m × n] · B [n × p] = C [m × p]
C_ij = ∑_k=0ⁿ⁻¹ A_ik · B_kj

Exercise 1.1: Matrix Multiply Build

Implement general matrix multiplication. A is [m, n], B is [n, p]. Return C = A · B which is [m, p].

Show solution

javascript
function matmul(A, B) {
  const m = A.length, n = A[0].length, p = B[0].length;
  const C = Array.from({length: m}, () => Array(p).fill(0));
  for (let i = 0; i < m; i++)
    for (let j = 0; j < p; j++)
      for (let k = 0; k < n; k++)
        C[i][j] += A[i][k] * B[k][j];
  return C;
}

Exercise 1.2: Linear Layer Forward Build

Implement y = xW + b. x is [batch, d_in], W is [d_in, d_out], b is [d_out]. Output is [batch, d_out]. You can call your matmul from 1.1 inside this function.

Show solution

javascript
function linear(x, W, b) {
  const batch = x.length, dOut = W[0].length, dIn = W.length;
  const out = Array.from({length: batch}, () => Array(dOut).fill(0));
  for (let i = 0; i < batch; i++)
    for (let j = 0; j < dOut; j++) {
      let s = b[j];
      for (let k = 0; k < dIn; k++) s += x[i][k] * W[k][j];
      out[i][j] = s;
    }
  return out;
}

Exercise 1.3: Transpose Build

Implement matrix transpose. You'll need this for backprop and attention. Input [m, n], output [n, m].

Show solution

javascript
function transpose(A) {
  const m = A.length, n = A[0].length;
  return Array.from({length: n}, (_, j) =>
    Array.from({length: m}, (_, i) => A[i][j]));
}

Exercise 1.4: Linear Backward Build

Given y = xW + b and dY (the gradient flowing back), compute the three gradients: dW = x^T·dY, db = sum of dY across the batch, dx = dY·W^T. Return an object {dW, db, dx}.

Show solution

javascript
function linearBackward(x, W, dY) {
  const batch = x.length, dIn = W.length, dOut = W[0].length;
  // dW = x^T * dY
  const dW = Array.from({length: dIn}, (_, i) =>
    Array.from({length: dOut}, (_, j) => {
      let s = 0;
      for (let b = 0; b < batch; b++) s += x[b][i] * dY[b][j];
      return s;
    }));
  // db = column sums of dY
  const db = Array.from({length: dOut}, (_, j) => {
    let s = 0;
    for (let b = 0; b < batch; b++) s += dY[b][j];
    return s;
  });
  // dx = dY * W^T
  const dx = Array.from({length: batch}, (_, b) =>
    Array.from({length: dIn}, (_, i) => {
      let s = 0;
      for (let j = 0; j < dOut; j++) s += dY[b][j] * W[i][j];
      return s;
    }));
  return { dW, db, dx };
}

Exercise 1.5: Shape Tracking Trace

x is [4, 512] and W is [512, 2048]. What is the shape of y = xW, and how many multiply-accumulate operations does it take?

y is [4, 2048], with 4 × 512 = 2,048 MACs y is [512, 512], with 512³ MACs y is [4, 2048], with 4 × 2048 × 512 = 4,194,304 MACs y is [2048, 4], with 2048 × 4 × 512 MACs

Exercise 1.6: ReLU Activation Build

Implement ReLU for a 2D array: max(0, x) element-wise. This goes between the two linear layers in the FFN.

Show solution

javascript
function relu(X) {
  return X.map(row => row.map(x => Math.max(0, x)));
}

Chapter 2: Attention

Attention is the core innovation of the transformer. Given queries Q, keys K, and values V, attention computes a weighted average of values where the weights come from how well each query matches each key. The formula:

Attention(Q, K, V) = softmax(Q · K^T / √d_k) · V

The √d_k scaling prevents dot products from growing too large as the dimension increases (which would push softmax into saturation, killing gradients).

Step by step. (1) Compute attention scores = Q·K^T giving a [seq, seq] matrix. (2) Scale by 1/√d_k. (3) Apply causal mask (optional — set future positions to −∞). (4) Softmax each row. (5) Multiply by V to get the output.

Exercise 2.1: Dot-Product Attention Build

Implement scaled dot-product attention. Q, K, V are all [seq_len, d_k]. No causal mask for now.

Show solution

javascript
function attention(Q, K, V) {
  const seq = Q.length, dk = Q[0].length;
  const scale = 1 / Math.sqrt(dk);
  // scores = Q * K^T → [seq, seq]
  const KT = Array.from({length: dk}, (_, j) =>
    Array.from({length: seq}, (_, i) => K[i][j]));
  const scores = Array.from({length: seq}, (_, i) =>
    Array.from({length: seq}, (_, j) => {
      let s = 0;
      for (let k = 0; k < dk; k++) s += Q[i][k] * KT[k][j];
      return s * scale;
    }));
  // softmax each row
  const weights = scores.map(row => {
    const max = Math.max(...row);
    const exps = row.map(x => Math.exp(x - max));
    const sum = exps.reduce((a, b) => a + b);
    return exps.map(e => e / sum);
  });
  // output = weights * V → [seq, dk]
  return Array.from({length: seq}, (_, i) =>
    Array.from({length: dk}, (_, j) => {
      let s = 0;
      for (let k = 0; k < seq; k++) s += weights[i][k] * V[k][j];
      return s;
    }));
}

Exercise 2.2: Causal Mask Build

Implement a function that creates a causal (lower-triangular) mask. Position i can only attend to positions 0..i. Future positions should be −Infinity so softmax zeros them out.

Show solution

javascript
function causalMask(seqLen) {
  return Array.from({length: seqLen}, (_, i) =>
    Array.from({length: seqLen}, (_, j) => j <= i ? 0 : -Infinity));
}

Exercise 2.3: Causal Attention Build

Now combine them: implement attention with a causal mask. Add the mask to the scaled scores before softmax.

Show solution

javascript
function causalAttention(Q, K, V) {
  const seq = Q.length, dk = Q[0].length;
  const scale = 1 / Math.sqrt(dk);
  // scores[i][j] = dot(Q[i], K[j]) * scale
  const scores = Array.from({length: seq}, (_, i) =>
    Array.from({length: seq}, (_, j) => {
      let s = 0;
      for (let k = 0; k < dk; k++) s += Q[i][k] * K[j][k];
      return j <= i ? s * scale : -Infinity;
    }));
  const weights = scores.map(row => {
    const max = Math.max(...row);
    const exps = row.map(x => Math.exp(x - max));
    const sum = exps.reduce((a, b) => a + b);
    return exps.map(e => e / sum);
  });
  return Array.from({length: seq}, (_, i) =>
    Array.from({length: dk}, (_, j) => {
      let s = 0;
      for (let k = 0; k < seq; k++) s += weights[i][k] * V[k][j];
      return s;
    }));
}

Exercise 2.4: Multi-Head Split Build

Multi-head attention splits the d_model dimension into h heads of size d_k = d_model/h. Implement splitHeads: reshape [seq, d_model] into [h, seq, d_k].

Show solution

javascript
function splitHeads(X, numHeads) {
  const seq = X.length, dModel = X[0].length;
  const dk = dModel / numHeads;
  return Array.from({length: numHeads}, (_, h) =>
    Array.from({length: seq}, (_, s) =>
      X[s].slice(h * dk, (h + 1) * dk)));
}

Exercise 2.5: Concat Heads Build

The reverse: merge [h, seq, d_k] back into [seq, d_model] by concatenating along the last dimension.

Show solution

javascript
function concatHeads(heads) {
  const seq = heads[0].length;
  return Array.from({length: seq}, (_, s) =>
    heads.flatMap(h => h[s]));
}

Exercise 2.6: Attention Complexity Trace

For a sequence of length n with model dimension d, what is the time complexity of self-attention, and why is it a bottleneck for long sequences?

O(n · d) — linear in sequence length O(n² · d) — the Q·K^T matrix is [n,n] and each entry takes d multiplies O(n²) — just the matrix size, no d factor O(d²) — dominated by the weight matrices

Chapter 3: Layer Normalization

Training deep networks is unstable without normalization. Layer Normalization normalizes across the feature dimension (each token independently), unlike Batch Norm which normalizes across the batch. For each token's feature vector, compute the mean and variance, normalize to zero mean and unit variance, then scale and shift with learnable parameters γ and β.

LayerNorm(x) = γ · (x − μ) / √(σ² + ε) + β
where μ = mean(x), σ² = var(x) across the feature dimension

RMSNorm is a simpler variant used in LLaMA and many modern models. It skips the mean subtraction and just divides by the root-mean-square:

RMSNorm(x) = γ · x / √(mean(x²) + ε)

Why RMSNorm over LayerNorm? RMSNorm is cheaper (no mean computation, no bias) and empirically works just as well. Since the mean subtraction in LayerNorm is redundant when followed by a linear layer (which adds its own bias), RMSNorm removes unnecessary computation.

Exercise 3.1: Layer Norm Build

Implement LayerNorm for a 1D vector. Input x is [d], gamma is [d], beta is [d], eps is a small constant.

Show solution

javascript
function layerNorm(x, gamma, beta, eps) {
  const d = x.length;
  const mean = x.reduce((a, b) => a + b) / d;
  const variance = x.reduce((a, v) => a + (v - mean) ** 2, 0) / d;
  return x.map((xi, i) =>
    gamma[i] * (xi - mean) / Math.sqrt(variance + eps) + beta[i]);
}

Exercise 3.2: RMS Norm Build

Implement RMSNorm. Simpler than LayerNorm: no mean subtraction, no beta. Just divide by the root-mean-square and scale by gamma.

Show solution

javascript
function rmsNorm(x, gamma, eps) {
  const d = x.length;
  const rms = Math.sqrt(x.reduce((a, v) => a + v * v, 0) / d + eps);
  return x.map((xi, i) => gamma[i] * xi / rms);
}

Exercise 3.3: Batched Layer Norm Build

Now handle a batch of tokens. Input is [batch, d]. Apply LayerNorm independently to each row.

Show solution

javascript
function batchLayerNorm(X, gamma, beta, eps) {
  return X.map(x => {
    const d = x.length;
    const mean = x.reduce((a, b) => a + b) / d;
    const vari = x.reduce((a, v) => a + (v - mean) ** 2, 0) / d;
    return x.map((xi, i) =>
      gamma[i] * (xi - mean) / Math.sqrt(vari + eps) + beta[i]);
  });
}

Exercise 3.4: Wrong Axis Bug Debug

This layerNorm computes mean and variance incorrectly. The output has wrong magnitudes. Click the buggy line.

function layerNorm(x, gamma, beta, eps) {
  const d = x.length;
  const mean = x.reduce((a, b) => a + b) / d;
  const variance = x.reduce((a, v) => a + (v - mean) ** 2, 0);
  return x.map((xi, i) =>
    gamma[i] * (xi - mean) / Math.sqrt(variance + eps) + beta[i]);
}

Show explanation

Line 4 is the bug. The variance computation doesn't divide by d. It computes the sum of squared deviations instead of the mean of squared deviations. It should be x.reduce((a, v) => a + (v - mean) ** 2, 0) / d. Without dividing by d, the normalization factor is √d too large, making the output values √d too small.

Exercise 3.5: Pre-Norm vs Post-Norm Trace

In the original Transformer (Vaswani 2017), LayerNorm is applied after the residual addition: x + LN(Attn(x)). Modern transformers use pre-norm: x + Attn(LN(x)). Why is pre-norm preferred?

Pre-norm produces better final accuracy Pre-norm uses fewer parameters Pre-norm stabilizes training — gradients flow through the residual unchanged, so deep networks train without warmup tricks Pre-norm is faster to compute

Chapter 4: Positional Encoding

Attention is permutation-invariant — it treats "the cat sat" and "sat the cat" identically. We need to inject position information. The original Transformer uses sinusoidal positional encodings: fixed patterns of sines and cosines at different frequencies.

PE(pos, 2i) = sin(pos / 10000^2i/d)
PE(pos, 2i+1) = cos(pos / 10000^2i/d)

Each dimension uses a different frequency. Low dimensions oscillate quickly (position-specific), high dimensions oscillate slowly (capturing long-range patterns). The key property: the encoding at position pos+k can be expressed as a linear function of the encoding at pos, making relative positions learnable.

Even dimensions use sin, odd dimensions use cos. For dimension index i, the frequency is 1/10000^2⌊i/2⌋/d. This gives a geometric progression of wavelengths from 2π (dimension 0) to 10000·2π (last dimension).

Exercise 4.1: Sinusoidal PE Build

Generate the full positional encoding table. Return a [seqLen, dModel] array where position pos and dimension i uses sin for even i and cos for odd i.

Show solution

javascript
function sinusoidalPE(seqLen, dModel) {
  const PE = [];
  for (let pos = 0; pos < seqLen; pos++) {
    const row = [];
    for (let i = 0; i < dModel; i++) {
      const freq = 1 / Math.pow(10000, 2 * Math.floor(i / 2) / dModel);
      row.push(i % 2 === 0 ? Math.sin(pos * freq) : Math.cos(pos * freq));
    }
    PE.push(row);
  }
  return PE;
}

Exercise 4.2: Dot Product Distance Build

A key property of sinusoidal PE: the dot product between PE[pos] and PE[pos+k] depends only on k, not on pos. Implement a function that computes the dot product of two position encodings.

Show solution

javascript
function peDotProduct(pos1, pos2, dModel) {
  let dot = 0;
  for (let i = 0; i < dModel; i++) {
    const freq = 1 / Math.pow(10000, 2 * Math.floor(i / 2) / dModel);
    const a = i % 2 === 0 ? Math.sin(pos1 * freq) : Math.cos(pos1 * freq);
    const b = i % 2 === 0 ? Math.sin(pos2 * freq) : Math.cos(pos2 * freq);
    dot += a * b;
  }
  return dot;
}

Exercise 4.3: Add PE to Embeddings Build

In practice, positional encodings are added to token embeddings element-wise. Implement this: given embeddings [seq, d] and PE [seq, d], return their element-wise sum.

Show solution

javascript
function addPE(embeddings, PE) {
  return embeddings.map((row, i) =>
    row.map((v, j) => v + PE[i][j]));
}

Exercise 4.4: Why Sinusoids? Trace

Why does the original Transformer use sinusoidal positional encodings instead of just using the position number (0, 1, 2, ...)?

Sinusoids are faster to compute Position numbers would require normalization Sinusoids are bounded [−1,1], encode relative distances via dot products, and generalize to unseen sequence lengths Sinusoids use less memory

Chapter 5: Tokenizer (BPE)

Before any tensor math, text must become numbers. Byte-Pair Encoding (BPE) starts with individual characters and iteratively merges the most frequent adjacent pair into a new token. After training, you have a merge table (ordered list of merges) that determines how to tokenize new text.

Encoding with a trained BPE tokenizer: start with individual characters, then apply merges in priority order. Each merge says "whenever you see tokens A and B adjacent, combine them into AB."

BPE is greedy and order-dependent. You must apply merges in the exact order they were learned during training. Merge #1 has highest priority. If merge #3 creates a pair that matches merge #1, you do NOT go back — you already applied merge #1.

Exercise 5.1: Apply BPE Merges Build

Given a list of character tokens and an ordered merge table, apply all applicable merges. Each merge is [tokenA, tokenB] meaning "replace adjacent A,B with A+B".

Show solution

javascript
function bpeEncode(chars, merges) {
  let tokens = [...chars];
  for (const [a, b] of merges) {
    let i = 0;
    while (i < tokens.length - 1) {
      if (tokens[i] === a && tokens[i + 1] === b) {
        tokens.splice(i, 2, a + b);
      } else {
        i++;
      }
    }
  }
  return tokens;
}

Exercise 5.2: BPE Decode Build

Decoding BPE is simple: given a list of token IDs and a vocabulary (id → string), concatenate the strings. Implement decode.

Show solution

javascript
function bpeDecode(tokenIds, vocab) {
  return tokenIds.map(id => vocab[String(id)]).join('');
}

Exercise 5.3: Find Most Frequent Pair Build

During BPE training, you repeatedly find the most frequent adjacent pair of tokens. Implement this: return the pair [tokenA, tokenB] that appears most often adjacently.

Show solution

javascript
function getMostFrequentPair(tokens) {
  if (tokens.length < 2) return null;
  const counts = {};
  for (let i = 0; i < tokens.length - 1; i++) {
    const key = tokens[i] + '\0' + tokens[i + 1];
    counts[key] = (counts[key] || 0) + 1;
  }
  let best = null, bestCount = 0;
  for (const [key, count] of Object.entries(counts)) {
    if (count > bestCount) {
      bestCount = count;
      best = key.split('\0');
    }
  }
  return best;
}

Exercise 5.4: BPE Vocabulary Size Trace

You start with 256 byte-level tokens and perform 49,744 BPE merges (like GPT-2). How many tokens are in the final vocabulary?

49,744 — only the merged tokens 50,000 — original 256 bytes + 49,744 new tokens from merges 256 — merges replace tokens, they don't add new ones It depends on the training corpus

Exercise 5.5: Token-to-ID Mapping Build

Given a vocabulary (array of token strings, where index = ID), convert a list of token strings to their IDs.

Show solution

javascript
function tokensToIds(tokens, vocab) {
  return tokens.map(t => {
    const idx = vocab.indexOf(t);
    return idx >= 0 ? idx : -1;
  });
}

Chapter 6: Cross-Entropy Loss

The loss function tells the model how wrong it is. For language modeling, we use cross-entropy loss: given logits over the vocabulary and the correct token index, compute −log(probability of the correct token). Lower is better. Zero means perfect confidence in the right answer.

L = −log(softmax(logits)_target) = −logSoftmax(logits)_target

For a batch of predictions, average the per-sample losses. The gradient of cross-entropy with respect to the logits has a beautifully simple form: softmax(logits) − one_hot(target). This is what flows backward through the network.

Why log-softmax? Computing −log(softmax(x)) directly is numerically unstable: softmax can underflow to 0, and log(0) = −∞. Using the log-softmax identity log_softmax(x_i) = x_i − max − log(∑ exp(x_j − max)) avoids both overflow and underflow.

Exercise 6.1: Cross-Entropy (Single Sample) Build

Compute cross-entropy loss for a single prediction. logits is [vocab_size], target is an integer index.

Show solution

javascript
function crossEntropy(logits, target) {
  const max = Math.max(...logits);
  const shifted = logits.map(x => x - max);
  const logSumExp = Math.log(shifted.reduce((a, x) => a + Math.exp(x), 0));
  return -(shifted[target] - logSumExp);
}

Exercise 6.2: Batch Cross-Entropy Build

Compute average cross-entropy over a batch. logits is [batch, vocab], targets is [batch] of indices.

Show solution

javascript
function batchCrossEntropy(logits, targets) {
  let total = 0;
  for (let i = 0; i < logits.length; i++) {
    const row = logits[i], t = targets[i];
    const max = Math.max(...row);
    const shifted = row.map(x => x - max);
    const lse = Math.log(shifted.reduce((a, x) => a + Math.exp(x), 0));
    total += -(shifted[t] - lse);
  }
  return total / logits.length;
}

Exercise 6.3: Cross-Entropy Gradient Build

The gradient of cross-entropy w.r.t. the logits is: dL/dlogits = softmax(logits) − one_hot(target). Implement this for a single sample.

Show solution

javascript
function crossEntropyGrad(logits, target) {
  const max = Math.max(...logits);
  const exps = logits.map(x => Math.exp(x - max));
  const sum = exps.reduce((a, b) => a + b);
  return exps.map((e, i) => e / sum - (i === target ? 1 : 0));
}

Exercise 6.4: Perplexity Build

Perplexity = exp(average cross-entropy loss). It measures how "confused" the model is. A perplexity of V means the model is as uncertain as picking randomly from V choices.

Show solution

javascript
function perplexity(logits, targets) {
  let total = 0;
  for (let i = 0; i < logits.length; i++) {
    const row = logits[i], t = targets[i];
    const max = Math.max(...row);
    const s = row.map(x => x - max);
    total += -(s[t] - Math.log(s.reduce((a, x) => a + Math.exp(x), 0)));
  }
  return Math.exp(total / logits.length);
}

Exercise 6.5: Gradient Sanity Check Trace

The cross-entropy gradient is softmax(logits) − one_hot(target). At the target position, the gradient is softmax[target] − 1, which is always negative. What does this negative gradient mean?

The model is overconfident The loss is decreasing Increasing the target logit will decrease the loss — the gradient says "push this logit higher" The model should decrease the target probability

Chapter 7: Adam Optimizer

SGD with a fixed learning rate is fragile. Adam (Adaptive Moment Estimation) maintains per-parameter running averages of the gradient (first moment, m) and the squared gradient (second moment, v). It adapts the learning rate for each parameter: parameters with noisy gradients get smaller steps, stable gradients get larger steps.

m_t = β₁ · m_t−1 + (1 − β₁) · g_t
v_t = β₂ · v_t−1 + (1 − β₂) · g_t²
m̂_t = m_t / (1 − β₁^t) // bias correction
v̂_t = v_t / (1 − β₂^t) // bias correction
θ_t = θ_t−1 − lr · m̂_t / (√v̂_t + ε)

Bias correction matters at the start. m and v are initialized to zero. Without bias correction, early estimates are biased toward zero (especially v with β₂=0.999 — it takes ~1000 steps to converge). The correction factors 1/(1−β^t) compensate for this cold start.

Exercise 7.1: Adam Step Build

Implement one step of Adam. params, grads, m, v are all 1D arrays of the same length. t is the step number (1-indexed). Return updated {params, m, v}.

Show solution

javascript
function adamStep(params, grads, m, v, t, lr, beta1, beta2, eps) {
  const n = params.length;
  const newM = [], newV = [], newP = [];
  for (let i = 0; i < n; i++) {
    newM[i] = beta1 * m[i] + (1 - beta1) * grads[i];
    newV[i] = beta2 * v[i] + (1 - beta2) * grads[i] * grads[i];
    const mHat = newM[i] / (1 - Math.pow(beta1, t));
    const vHat = newV[i] / (1 - Math.pow(beta2, t));
    newP[i] = params[i] - lr * mHat / (Math.sqrt(vHat) + eps);
  }
  return { params: newP, m: newM, v: newV };
}

Exercise 7.2: Minimize x² Build

Use your Adam implementation to minimize f(x) = x² starting from x=5. Run 100 steps. The gradient of x² is 2x. Return the final value of x (should be near 0).

Show solution

javascript
function minimizeQuadratic(x0, numSteps) {
  let p = [x0], m = [0], v = [0];
  for (let t = 1; t <= numSteps; t++) {
    const g = [2 * p[0]]; // gradient of x^2
    const b1 = 0.9, b2 = 0.999, eps = 1e-8, lr = 0.1;
    m = [b1 * m[0] + (1 - b1) * g[0]];
    v = [b2 * v[0] + (1 - b2) * g[0] * g[0]];
    const mH = m[0] / (1 - Math.pow(b1, t));
    const vH = v[0] / (1 - Math.pow(b2, t));
    p = [p[0] - lr * mH / (Math.sqrt(vH) + eps)];
  }
  return p[0];
}

Exercise 7.3: SGD for Comparison Build

Implement vanilla SGD step for comparison. Much simpler: θ = θ − lr · grad.

Show solution

javascript
function sgdStep(params, grads, lr) {
  return params.map((p, i) => p - lr * grads[i]);
}

Exercise 7.4: Why Bias Correction? Trace

At step t=1 with β₁=0.9, the raw first moment is m₁ = 0.1 · g₁. The bias-corrected version is m̂₁ = m₁ / (1 − 0.9¹) = m₁ / 0.1 = g₁. What is the correction factor doing?

Making the learning rate larger at the start Compensating for the zero initialization — m was initialized to 0, so early estimates are biased toward 0. The correction recovers the true gradient magnitude Normalizing the gradient to unit length Preventing the second moment from dominating

Exercise 7.5: Find the Bug Debug

This Adam implementation is producing NaN after a few steps. Click the buggy line.

function adamStep(p, g, m, v, t, lr, b1, b2, eps) {
  const newM = b1 * m + (1 - b1) * g;
  const newV = b2 * v + (1 - b2) * g * g;
  const mHat = newM / (1 - Math.pow(b1, t));
  const vHat = newV / (1 - Math.pow(b2, t));
  const newP = p - lr * mHat / Math.sqrt(vHat);
  return { p: newP, m: newM, v: newV };
}

Show explanation

Line 6 is the bug. The epsilon is missing from the denominator. It should be Math.sqrt(vHat) + eps. Without epsilon, when vHat is very small (near zero), you divide by a near-zero number, producing huge values that eventually become NaN. The epsilon (typically 1e-8) prevents division by zero.

Chapter 8: KV Cache

During autoregressive generation, we produce one token at a time. Naively, each new token requires recomputing attention over the entire sequence. But the K and V projections of previous tokens don't change — only the new token's Q, K, V are new. A KV cache stores previously computed K and V, so each generation step only computes attention for the new query against all cached keys and values.

Without cache: O(n² · d) per token — recompute everything
With cache: O(n · d) per token — one new query against n cached keys

The speedup is dramatic. For a 2048-token sequence, without cache you do 2048² = 4M operations of attention per layer per step. With cache, you do 2048 operations. That's a 2048x reduction. The trade-off: you use O(n · d) memory to store the cache.

Exercise 8.1: KV Cache Object Build

Implement a KV cache with append(newK, newV) and get() methods. newK and newV are 1D arrays (one token's key/value vectors). get() returns {keys, values} as 2D arrays.

Show solution

javascript
function createKVCache() {
  const keys = [], values = [];
  return {
    append(newK, newV) { keys.push([...newK]); values.push([...newV]); },
    get() { return { keys: keys.map(k => [...k]), values: values.map(v => [...v]) }; },
    get length() { return keys.length; },
  };
}

Exercise 8.2: Cached Attention Build

Implement single-query attention against a KV cache. q is [d_k] (one query vector), keys is [n, d_k], values is [n, d_k]. Return the attention output [d_k].

Show solution

javascript
function cachedAttention(q, keys, values) {
  const dk = q.length, n = keys.length;
  const scale = 1 / Math.sqrt(dk);
  // scores
  const scores = keys.map(k => {
    let s = 0;
    for (let i = 0; i < dk; i++) s += q[i] * k[i];
    return s * scale;
  });
  // softmax
  const max = Math.max(...scores);
  const exps = scores.map(s => Math.exp(s - max));
  const sum = exps.reduce((a, b) => a + b);
  const weights = exps.map(e => e / sum);
  // weighted sum of values
  return Array.from({length: dk}, (_, i) => {
    let s = 0;
    for (let j = 0; j < n; j++) s += weights[j] * values[j][i];
    return s;
  });
}

Exercise 8.3: KV Cache Memory Derive

LLaMA 7B: d=4096, 32 layers, 32 heads (d_k = 128), FP16. For a sequence of 2048 tokens, how much memory does the KV cache require?

Per layer: 2 (K and V) × seq_len × d × 2 bytes. Total = 2 × seq × d × 2 × num_layers.

Show derivation

Per layer = 2 × 2048 × 4096 × 2 bytes = 33,554,432 bytes = 32 MB

All 32 layers = 32 × 32 MB = 1024 MB = 1 GB

For a 7B model, the KV cache for a single 2048-token sequence takes 1 GB of GPU memory. This is why long-context inference is memory-bound. At 128K context, this would be 64 GB — more than the model weights.

Exercise 8.4: Generation Loop Build

Implement a simple autoregressive generation loop using a KV cache. Given a "model" function that returns (logit, newK, newV) for each token, generate numTokens by always picking the highest-logit token (greedy).

Show solution

javascript
function generate(modelFn, startToken, numTokens) {
  const keys = [], values = [], output = [];
  let token = startToken;
  for (let i = 0; i < numTokens; i++) {
    const { logit, newK, newV } = modelFn(token, keys, values);
    keys.push(newK);
    values.push(newV);
    token = logit;
    output.push(token);
  }
  return output;
}

Exercise 8.5: Cache vs No-Cache Complexity Trace

You generate 1000 tokens without a KV cache. Each step recomputes attention for the full sequence. How many total attention operations (proportional to ∑ i²) compared to with cache (∑ i)?

About 2x more without cache About 10x more without cache About 667x more — ∑i² ≈ n³/3 vs ∑i ≈ n²/2, ratio ≈ 2n/3 ≈ 667 The same — the cache just saves memory, not compute

Chapter 9: Top-k and Top-p Sampling

Greedy decoding (always pick the highest probability token) produces repetitive, boring text. Sampling adds randomness. But sampling from the full distribution gives incoherent results — rare tokens like "banana" might follow "The president signed the". Top-k and top-p (nucleus) sampling restrict the sampling pool to only plausible tokens.

Top-k keeps the k highest-probability tokens and zeros the rest. Top-p keeps the smallest set of tokens whose cumulative probability exceeds p. Top-p is adaptive: for high-confidence predictions it might keep only 2-3 tokens, while for uncertain predictions it keeps many.

Exercise 9.1: Temperature Scaling Build

Implement temperature-scaled sampling. Divide logits by T, apply softmax, then sample from the distribution. Return the sampled index. For deterministic testing, use the provided random number r ∈ [0,1) instead of Math.random().

Show solution

javascript
function temperatureSample(logits, T, r) {
  const scaled = logits.map(x => x / T);
  const max = Math.max(...scaled);
  const exps = scaled.map(x => Math.exp(x - max));
  const sum = exps.reduce((a, b) => a + b);
  const probs = exps.map(e => e / sum);
  let cum = 0;
  for (let i = 0; i < probs.length; i++) {
    cum += probs[i];
    if (cum > r) return i;
  }
  return probs.length - 1;
}

Exercise 9.2: Top-k Filtering Build

Implement top-k filtering: keep the k highest-probability tokens, set the rest to −Infinity, then apply softmax. Return the filtered probability distribution.

Show solution

javascript
function topKFilter(logits, k) {
  const sorted = [...logits].sort((a, b) => b - a);
  const threshold = sorted[k - 1];
  const filtered = logits.map(x => x >= threshold ? x : -Infinity);
  const max = Math.max(...filtered);
  const exps = filtered.map(x => Math.exp(x - max));
  const sum = exps.reduce((a, b) => a + b);
  return exps.map(e => e / sum);
}

Exercise 9.3: Top-p (Nucleus) Filtering Build

Implement top-p sampling: sort tokens by probability, include tokens until cumulative probability exceeds p, zero the rest. Return the renormalized distribution.

Show solution

javascript
function topPFilter(logits, p) {
  // softmax first
  const max = Math.max(...logits);
  const exps = logits.map(x => Math.exp(x - max));
  const sum = exps.reduce((a, b) => a + b);
  const probs = exps.map(e => e / sum);
  // sort indices by probability descending
  const indices = probs.map((v, i) => i).sort((a, b) => probs[b] - probs[a]);
  let cum = 0;
  const keep = new Set();
  for (const idx of indices) {
    keep.add(idx);
    cum += probs[idx];
    if (cum > p) break;
  }
  const filtered = probs.map((v, i) => keep.has(i) ? v : 0);
  const fSum = filtered.reduce((a, b) => a + b);
  return filtered.map(v => v / fSum);
}

Exercise 9.4: Repetition Penalty Build

Implement repetition penalty: divide the logits of previously-generated tokens by a penalty factor. This discourages the model from repeating itself.

Show solution

javascript
function applyRepetitionPenalty(logits, generatedTokens, penalty) {
  const result = [...logits];
  const seen = new Set(generatedTokens);
  for (let i = 0; i < result.length; i++) {
    if (seen.has(i)) result[i] /= penalty;
  }
  return result;
}

Exercise 9.5: Top-k vs Top-p Trace

Consider a model predicting the next word after "The capital of France is". The top 2 tokens are "Paris" (p=0.95) and "the" (p=0.02). With top-k=5, you sample from 5 tokens. With top-p=0.95, how many tokens do you sample from?

5 tokens, same as top-k Just 1 token ("Paris" alone exceeds p=0.95) — top-p is adaptive and narrows the pool when the model is confident 2 tokens (Paris + the) All tokens in the vocabulary

Exercise 9.6: Full Sampling Pipeline Build

Combine everything: apply temperature, then top-k filtering, then sample. Return the sampled index.

Show solution

javascript
function sampleWithTopK(logits, temperature, k, r) {
  const scaled = logits.map(x => x / temperature);
  // top-k filter
  const sorted = [...scaled].sort((a, b) => b - a);
  const threshold = sorted[k - 1];
  const filtered = scaled.map(x => x >= threshold ? x : -Infinity);
  // softmax
  const max = Math.max(...filtered);
  const exps = filtered.map(x => Math.exp(x - max));
  const sum = exps.reduce((a, b) => a + b);
  const probs = exps.map(e => e / sum);
  // sample
  let cum = 0;
  for (let i = 0; i < probs.length; i++) {
    cum += probs[i];
    if (cum > r) return i;
  }
  return probs.length - 1;
}