What I cannot create, I do not understand. Every core building block of a transformer — softmax, attention, normalization, optimization, sampling — implemented in plain JavaScript with instant test feedback.
Every transformer starts here. You have a vector of raw scores — logits — and you need to turn them into a probability distribution. That's softmax: exponentiate each value, then divide by the sum. Sounds trivial. Until your logits are [1000, 1001, 999] and Math.exp(1000) returns Infinity.
The trick is numerical stability: subtract the max before exponentiating. The math is identical (it factors out as a constant in numerator and denominator), but the largest exponent becomes e0 = 1 instead of e1000 = boom.
Write a numerically stable softmax. Input: array of logits. Output: array of probabilities that sum to 1.
javascript function softmax(logits) { const max = Math.max(...logits); const exps = logits.map(x => Math.exp(x - max)); const sum = exps.reduce((a, b) => a + b, 0); return exps.map(e => e / sum); }
Log-softmax is used everywhere in loss computation because log(softmax(x)) computed naively loses precision. Instead, compute it directly: log_softmax(xi) = (xi − max) − log(∑ exp(xj − max)).
javascript function logSoftmax(logits) { const max = Math.max(...logits); const shifted = logits.map(x => x - max); const logSumExp = Math.log(shifted.reduce((a, x) => a + Math.exp(x), 0)); return shifted.map(x => x - logSumExp); }
This softmax implementation has a subtle bug. It works for small values but overflows for large logits. Click the buggy line.
function softmax(logits) { const exps = logits.map(x => Math.exp(x)); const sum = exps.reduce((a, b) => a + b, 0); return exps.map(e => e / sum); }
Line 2 is the bug. It exponentiates raw logits without subtracting the max first. For logits like [1000, 1001, 999], Math.exp(1000) returns Infinity. The fix: compute const max = Math.max(...logits) and use Math.exp(x - max).
Implement softmax with temperature: divide logits by T before applying softmax. T>1 makes the distribution more uniform (softer). T<1 makes it peakier (sharper). T→0 approaches argmax.
javascript function softmaxTemp(logits, T) { const scaled = logits.map(x => x / T); const max = Math.max(...scaled); const exps = scaled.map(x => Math.exp(x - max)); const sum = exps.reduce((a, b) => a + b, 0); return exps.map(e => e / sum); }
The linear layer is the workhorse of neural networks. It's just a matrix multiply plus a bias: y = xW + b. Every projection in a transformer — Q, K, V, output, FFN up, FFN down — is a linear layer. If you can implement matmul, you can implement everything else.
We work with 2D arrays where x[i][j] means row i, column j. Matrix multiply of A [m×n] and B [n×p] produces C [m×p] where C[i][j] = ∑k A[i][k] · B[k][j].
Implement general matrix multiplication. A is [m, n], B is [n, p]. Return C = A · B which is [m, p].
javascript function matmul(A, B) { const m = A.length, n = A[0].length, p = B[0].length; const C = Array.from({length: m}, () => Array(p).fill(0)); for (let i = 0; i < m; i++) for (let j = 0; j < p; j++) for (let k = 0; k < n; k++) C[i][j] += A[i][k] * B[k][j]; return C; }
Implement y = xW + b. x is [batch, d_in], W is [d_in, d_out], b is [d_out]. Output is [batch, d_out]. You can call your matmul from 1.1 inside this function.
javascript function linear(x, W, b) { const batch = x.length, dOut = W[0].length, dIn = W.length; const out = Array.from({length: batch}, () => Array(dOut).fill(0)); for (let i = 0; i < batch; i++) for (let j = 0; j < dOut; j++) { let s = b[j]; for (let k = 0; k < dIn; k++) s += x[i][k] * W[k][j]; out[i][j] = s; } return out; }
Implement matrix transpose. You'll need this for backprop and attention. Input [m, n], output [n, m].
javascript function transpose(A) { const m = A.length, n = A[0].length; return Array.from({length: n}, (_, j) => Array.from({length: m}, (_, i) => A[i][j])); }
Given y = xW + b and dY (the gradient flowing back), compute the three gradients: dW = xT·dY, db = sum of dY across the batch, dx = dY·WT. Return an object {dW, db, dx}.
javascript function linearBackward(x, W, dY) { const batch = x.length, dIn = W.length, dOut = W[0].length; // dW = x^T * dY const dW = Array.from({length: dIn}, (_, i) => Array.from({length: dOut}, (_, j) => { let s = 0; for (let b = 0; b < batch; b++) s += x[b][i] * dY[b][j]; return s; })); // db = column sums of dY const db = Array.from({length: dOut}, (_, j) => { let s = 0; for (let b = 0; b < batch; b++) s += dY[b][j]; return s; }); // dx = dY * W^T const dx = Array.from({length: batch}, (_, b) => Array.from({length: dIn}, (_, i) => { let s = 0; for (let j = 0; j < dOut; j++) s += dY[b][j] * W[i][j]; return s; })); return { dW, db, dx }; }
Implement ReLU for a 2D array: max(0, x) element-wise. This goes between the two linear layers in the FFN.
javascript function relu(X) { return X.map(row => row.map(x => Math.max(0, x))); }
Attention is the core innovation of the transformer. Given queries Q, keys K, and values V, attention computes a weighted average of values where the weights come from how well each query matches each key. The formula:
The √dk scaling prevents dot products from growing too large as the dimension increases (which would push softmax into saturation, killing gradients).
Implement scaled dot-product attention. Q, K, V are all [seq_len, d_k]. No causal mask for now.
javascript function attention(Q, K, V) { const seq = Q.length, dk = Q[0].length; const scale = 1 / Math.sqrt(dk); // scores = Q * K^T → [seq, seq] const KT = Array.from({length: dk}, (_, j) => Array.from({length: seq}, (_, i) => K[i][j])); const scores = Array.from({length: seq}, (_, i) => Array.from({length: seq}, (_, j) => { let s = 0; for (let k = 0; k < dk; k++) s += Q[i][k] * KT[k][j]; return s * scale; })); // softmax each row const weights = scores.map(row => { const max = Math.max(...row); const exps = row.map(x => Math.exp(x - max)); const sum = exps.reduce((a, b) => a + b); return exps.map(e => e / sum); }); // output = weights * V → [seq, dk] return Array.from({length: seq}, (_, i) => Array.from({length: dk}, (_, j) => { let s = 0; for (let k = 0; k < seq; k++) s += weights[i][k] * V[k][j]; return s; })); }
Implement a function that creates a causal (lower-triangular) mask. Position i can only attend to positions 0..i. Future positions should be −Infinity so softmax zeros them out.
javascript function causalMask(seqLen) { return Array.from({length: seqLen}, (_, i) => Array.from({length: seqLen}, (_, j) => j <= i ? 0 : -Infinity)); }
Now combine them: implement attention with a causal mask. Add the mask to the scaled scores before softmax.
javascript function causalAttention(Q, K, V) { const seq = Q.length, dk = Q[0].length; const scale = 1 / Math.sqrt(dk); // scores[i][j] = dot(Q[i], K[j]) * scale const scores = Array.from({length: seq}, (_, i) => Array.from({length: seq}, (_, j) => { let s = 0; for (let k = 0; k < dk; k++) s += Q[i][k] * K[j][k]; return j <= i ? s * scale : -Infinity; })); const weights = scores.map(row => { const max = Math.max(...row); const exps = row.map(x => Math.exp(x - max)); const sum = exps.reduce((a, b) => a + b); return exps.map(e => e / sum); }); return Array.from({length: seq}, (_, i) => Array.from({length: dk}, (_, j) => { let s = 0; for (let k = 0; k < seq; k++) s += weights[i][k] * V[k][j]; return s; })); }
Multi-head attention splits the d_model dimension into h heads of size d_k = d_model/h. Implement splitHeads: reshape [seq, d_model] into [h, seq, d_k].
javascript function splitHeads(X, numHeads) { const seq = X.length, dModel = X[0].length; const dk = dModel / numHeads; return Array.from({length: numHeads}, (_, h) => Array.from({length: seq}, (_, s) => X[s].slice(h * dk, (h + 1) * dk))); }
The reverse: merge [h, seq, d_k] back into [seq, d_model] by concatenating along the last dimension.
javascript function concatHeads(heads) { const seq = heads[0].length; return Array.from({length: seq}, (_, s) => heads.flatMap(h => h[s])); }
Training deep networks is unstable without normalization. Layer Normalization normalizes across the feature dimension (each token independently), unlike Batch Norm which normalizes across the batch. For each token's feature vector, compute the mean and variance, normalize to zero mean and unit variance, then scale and shift with learnable parameters γ and β.
RMSNorm is a simpler variant used in LLaMA and many modern models. It skips the mean subtraction and just divides by the root-mean-square:
Implement LayerNorm for a 1D vector. Input x is [d], gamma is [d], beta is [d], eps is a small constant.
javascript function layerNorm(x, gamma, beta, eps) { const d = x.length; const mean = x.reduce((a, b) => a + b) / d; const variance = x.reduce((a, v) => a + (v - mean) ** 2, 0) / d; return x.map((xi, i) => gamma[i] * (xi - mean) / Math.sqrt(variance + eps) + beta[i]); }
Implement RMSNorm. Simpler than LayerNorm: no mean subtraction, no beta. Just divide by the root-mean-square and scale by gamma.
javascript function rmsNorm(x, gamma, eps) { const d = x.length; const rms = Math.sqrt(x.reduce((a, v) => a + v * v, 0) / d + eps); return x.map((xi, i) => gamma[i] * xi / rms); }
Now handle a batch of tokens. Input is [batch, d]. Apply LayerNorm independently to each row.
javascript function batchLayerNorm(X, gamma, beta, eps) { return X.map(x => { const d = x.length; const mean = x.reduce((a, b) => a + b) / d; const vari = x.reduce((a, v) => a + (v - mean) ** 2, 0) / d; return x.map((xi, i) => gamma[i] * (xi - mean) / Math.sqrt(vari + eps) + beta[i]); }); }
This layerNorm computes mean and variance incorrectly. The output has wrong magnitudes. Click the buggy line.
function layerNorm(x, gamma, beta, eps) { const d = x.length; const mean = x.reduce((a, b) => a + b) / d; const variance = x.reduce((a, v) => a + (v - mean) ** 2, 0); return x.map((xi, i) => gamma[i] * (xi - mean) / Math.sqrt(variance + eps) + beta[i]); }
Line 4 is the bug. The variance computation doesn't divide by d. It computes the sum of squared deviations instead of the mean of squared deviations. It should be x.reduce((a, v) => a + (v - mean) ** 2, 0) / d. Without dividing by d, the normalization factor is √d too large, making the output values √d too small.
x + LN(Attn(x)). Modern transformers use pre-norm: x + Attn(LN(x)). Why is pre-norm preferred?Attention is permutation-invariant — it treats "the cat sat" and "sat the cat" identically. We need to inject position information. The original Transformer uses sinusoidal positional encodings: fixed patterns of sines and cosines at different frequencies.
Each dimension uses a different frequency. Low dimensions oscillate quickly (position-specific), high dimensions oscillate slowly (capturing long-range patterns). The key property: the encoding at position pos+k can be expressed as a linear function of the encoding at pos, making relative positions learnable.
Generate the full positional encoding table. Return a [seqLen, dModel] array where position pos and dimension i uses sin for even i and cos for odd i.
javascript function sinusoidalPE(seqLen, dModel) { const PE = []; for (let pos = 0; pos < seqLen; pos++) { const row = []; for (let i = 0; i < dModel; i++) { const freq = 1 / Math.pow(10000, 2 * Math.floor(i / 2) / dModel); row.push(i % 2 === 0 ? Math.sin(pos * freq) : Math.cos(pos * freq)); } PE.push(row); } return PE; }
A key property of sinusoidal PE: the dot product between PE[pos] and PE[pos+k] depends only on k, not on pos. Implement a function that computes the dot product of two position encodings.
javascript function peDotProduct(pos1, pos2, dModel) { let dot = 0; for (let i = 0; i < dModel; i++) { const freq = 1 / Math.pow(10000, 2 * Math.floor(i / 2) / dModel); const a = i % 2 === 0 ? Math.sin(pos1 * freq) : Math.cos(pos1 * freq); const b = i % 2 === 0 ? Math.sin(pos2 * freq) : Math.cos(pos2 * freq); dot += a * b; } return dot; }
In practice, positional encodings are added to token embeddings element-wise. Implement this: given embeddings [seq, d] and PE [seq, d], return their element-wise sum.
javascript function addPE(embeddings, PE) { return embeddings.map((row, i) => row.map((v, j) => v + PE[i][j])); }
Before any tensor math, text must become numbers. Byte-Pair Encoding (BPE) starts with individual characters and iteratively merges the most frequent adjacent pair into a new token. After training, you have a merge table (ordered list of merges) that determines how to tokenize new text.
Encoding with a trained BPE tokenizer: start with individual characters, then apply merges in priority order. Each merge says "whenever you see tokens A and B adjacent, combine them into AB."
Given a list of character tokens and an ordered merge table, apply all applicable merges. Each merge is [tokenA, tokenB] meaning "replace adjacent A,B with A+B".
javascript function bpeEncode(chars, merges) { let tokens = [...chars]; for (const [a, b] of merges) { let i = 0; while (i < tokens.length - 1) { if (tokens[i] === a && tokens[i + 1] === b) { tokens.splice(i, 2, a + b); } else { i++; } } } return tokens; }
Decoding BPE is simple: given a list of token IDs and a vocabulary (id → string), concatenate the strings. Implement decode.
javascript function bpeDecode(tokenIds, vocab) { return tokenIds.map(id => vocab[String(id)]).join(''); }
During BPE training, you repeatedly find the most frequent adjacent pair of tokens. Implement this: return the pair [tokenA, tokenB] that appears most often adjacently.
javascript function getMostFrequentPair(tokens) { if (tokens.length < 2) return null; const counts = {}; for (let i = 0; i < tokens.length - 1; i++) { const key = tokens[i] + '\0' + tokens[i + 1]; counts[key] = (counts[key] || 0) + 1; } let best = null, bestCount = 0; for (const [key, count] of Object.entries(counts)) { if (count > bestCount) { bestCount = count; best = key.split('\0'); } } return best; }
Given a vocabulary (array of token strings, where index = ID), convert a list of token strings to their IDs.
javascript function tokensToIds(tokens, vocab) { return tokens.map(t => { const idx = vocab.indexOf(t); return idx >= 0 ? idx : -1; }); }
The loss function tells the model how wrong it is. For language modeling, we use cross-entropy loss: given logits over the vocabulary and the correct token index, compute −log(probability of the correct token). Lower is better. Zero means perfect confidence in the right answer.
For a batch of predictions, average the per-sample losses. The gradient of cross-entropy with respect to the logits has a beautifully simple form: softmax(logits) − one_hot(target). This is what flows backward through the network.
Compute cross-entropy loss for a single prediction. logits is [vocab_size], target is an integer index.
javascript function crossEntropy(logits, target) { const max = Math.max(...logits); const shifted = logits.map(x => x - max); const logSumExp = Math.log(shifted.reduce((a, x) => a + Math.exp(x), 0)); return -(shifted[target] - logSumExp); }
Compute average cross-entropy over a batch. logits is [batch, vocab], targets is [batch] of indices.
javascript function batchCrossEntropy(logits, targets) { let total = 0; for (let i = 0; i < logits.length; i++) { const row = logits[i], t = targets[i]; const max = Math.max(...row); const shifted = row.map(x => x - max); const lse = Math.log(shifted.reduce((a, x) => a + Math.exp(x), 0)); total += -(shifted[t] - lse); } return total / logits.length; }
The gradient of cross-entropy w.r.t. the logits is: dL/dlogits = softmax(logits) − one_hot(target). Implement this for a single sample.
javascript function crossEntropyGrad(logits, target) { const max = Math.max(...logits); const exps = logits.map(x => Math.exp(x - max)); const sum = exps.reduce((a, b) => a + b); return exps.map((e, i) => e / sum - (i === target ? 1 : 0)); }
Perplexity = exp(average cross-entropy loss). It measures how "confused" the model is. A perplexity of V means the model is as uncertain as picking randomly from V choices.
javascript function perplexity(logits, targets) { let total = 0; for (let i = 0; i < logits.length; i++) { const row = logits[i], t = targets[i]; const max = Math.max(...row); const s = row.map(x => x - max); total += -(s[t] - Math.log(s.reduce((a, x) => a + Math.exp(x), 0))); } return Math.exp(total / logits.length); }
SGD with a fixed learning rate is fragile. Adam (Adaptive Moment Estimation) maintains per-parameter running averages of the gradient (first moment, m) and the squared gradient (second moment, v). It adapts the learning rate for each parameter: parameters with noisy gradients get smaller steps, stable gradients get larger steps.
Implement one step of Adam. params, grads, m, v are all 1D arrays of the same length. t is the step number (1-indexed). Return updated {params, m, v}.
javascript function adamStep(params, grads, m, v, t, lr, beta1, beta2, eps) { const n = params.length; const newM = [], newV = [], newP = []; for (let i = 0; i < n; i++) { newM[i] = beta1 * m[i] + (1 - beta1) * grads[i]; newV[i] = beta2 * v[i] + (1 - beta2) * grads[i] * grads[i]; const mHat = newM[i] / (1 - Math.pow(beta1, t)); const vHat = newV[i] / (1 - Math.pow(beta2, t)); newP[i] = params[i] - lr * mHat / (Math.sqrt(vHat) + eps); } return { params: newP, m: newM, v: newV }; }
Use your Adam implementation to minimize f(x) = x² starting from x=5. Run 100 steps. The gradient of x² is 2x. Return the final value of x (should be near 0).
javascript function minimizeQuadratic(x0, numSteps) { let p = [x0], m = [0], v = [0]; for (let t = 1; t <= numSteps; t++) { const g = [2 * p[0]]; // gradient of x^2 const b1 = 0.9, b2 = 0.999, eps = 1e-8, lr = 0.1; m = [b1 * m[0] + (1 - b1) * g[0]]; v = [b2 * v[0] + (1 - b2) * g[0] * g[0]]; const mH = m[0] / (1 - Math.pow(b1, t)); const vH = v[0] / (1 - Math.pow(b2, t)); p = [p[0] - lr * mH / (Math.sqrt(vH) + eps)]; } return p[0]; }
Implement vanilla SGD step for comparison. Much simpler: θ = θ − lr · grad.
javascript function sgdStep(params, grads, lr) { return params.map((p, i) => p - lr * grads[i]); }
This Adam implementation is producing NaN after a few steps. Click the buggy line.
function adamStep(p, g, m, v, t, lr, b1, b2, eps) { const newM = b1 * m + (1 - b1) * g; const newV = b2 * v + (1 - b2) * g * g; const mHat = newM / (1 - Math.pow(b1, t)); const vHat = newV / (1 - Math.pow(b2, t)); const newP = p - lr * mHat / Math.sqrt(vHat); return { p: newP, m: newM, v: newV }; }
Line 6 is the bug. The epsilon is missing from the denominator. It should be Math.sqrt(vHat) + eps. Without epsilon, when vHat is very small (near zero), you divide by a near-zero number, producing huge values that eventually become NaN. The epsilon (typically 1e-8) prevents division by zero.
During autoregressive generation, we produce one token at a time. Naively, each new token requires recomputing attention over the entire sequence. But the K and V projections of previous tokens don't change — only the new token's Q, K, V are new. A KV cache stores previously computed K and V, so each generation step only computes attention for the new query against all cached keys and values.
Implement a KV cache with append(newK, newV) and get() methods. newK and newV are 1D arrays (one token's key/value vectors). get() returns {keys, values} as 2D arrays.
javascript function createKVCache() { const keys = [], values = []; return { append(newK, newV) { keys.push([...newK]); values.push([...newV]); }, get() { return { keys: keys.map(k => [...k]), values: values.map(v => [...v]) }; }, get length() { return keys.length; }, }; }
Implement single-query attention against a KV cache. q is [d_k] (one query vector), keys is [n, d_k], values is [n, d_k]. Return the attention output [d_k].
javascript function cachedAttention(q, keys, values) { const dk = q.length, n = keys.length; const scale = 1 / Math.sqrt(dk); // scores const scores = keys.map(k => { let s = 0; for (let i = 0; i < dk; i++) s += q[i] * k[i]; return s * scale; }); // softmax const max = Math.max(...scores); const exps = scores.map(s => Math.exp(s - max)); const sum = exps.reduce((a, b) => a + b); const weights = exps.map(e => e / sum); // weighted sum of values return Array.from({length: dk}, (_, i) => { let s = 0; for (let j = 0; j < n; j++) s += weights[j] * values[j][i]; return s; }); }
LLaMA 7B: d=4096, 32 layers, 32 heads (d_k = 128), FP16. For a sequence of 2048 tokens, how much memory does the KV cache require?
Per layer: 2 (K and V) × seq_len × d × 2 bytes. Total = 2 × seq × d × 2 × num_layers.
For a 7B model, the KV cache for a single 2048-token sequence takes 1 GB of GPU memory. This is why long-context inference is memory-bound. At 128K context, this would be 64 GB — more than the model weights.
Implement a simple autoregressive generation loop using a KV cache. Given a "model" function that returns (logit, newK, newV) for each token, generate numTokens by always picking the highest-logit token (greedy).
javascript function generate(modelFn, startToken, numTokens) { const keys = [], values = [], output = []; let token = startToken; for (let i = 0; i < numTokens; i++) { const { logit, newK, newV } = modelFn(token, keys, values); keys.push(newK); values.push(newV); token = logit; output.push(token); } return output; }
Greedy decoding (always pick the highest probability token) produces repetitive, boring text. Sampling adds randomness. But sampling from the full distribution gives incoherent results — rare tokens like "banana" might follow "The president signed the". Top-k and top-p (nucleus) sampling restrict the sampling pool to only plausible tokens.
Implement temperature-scaled sampling. Divide logits by T, apply softmax, then sample from the distribution. Return the sampled index. For deterministic testing, use the provided random number r ∈ [0,1) instead of Math.random().
javascript function temperatureSample(logits, T, r) { const scaled = logits.map(x => x / T); const max = Math.max(...scaled); const exps = scaled.map(x => Math.exp(x - max)); const sum = exps.reduce((a, b) => a + b); const probs = exps.map(e => e / sum); let cum = 0; for (let i = 0; i < probs.length; i++) { cum += probs[i]; if (cum > r) return i; } return probs.length - 1; }
Implement top-k filtering: keep the k highest-probability tokens, set the rest to −Infinity, then apply softmax. Return the filtered probability distribution.
javascript function topKFilter(logits, k) { const sorted = [...logits].sort((a, b) => b - a); const threshold = sorted[k - 1]; const filtered = logits.map(x => x >= threshold ? x : -Infinity); const max = Math.max(...filtered); const exps = filtered.map(x => Math.exp(x - max)); const sum = exps.reduce((a, b) => a + b); return exps.map(e => e / sum); }
Implement top-p sampling: sort tokens by probability, include tokens until cumulative probability exceeds p, zero the rest. Return the renormalized distribution.
javascript function topPFilter(logits, p) { // softmax first const max = Math.max(...logits); const exps = logits.map(x => Math.exp(x - max)); const sum = exps.reduce((a, b) => a + b); const probs = exps.map(e => e / sum); // sort indices by probability descending const indices = probs.map((v, i) => i).sort((a, b) => probs[b] - probs[a]); let cum = 0; const keep = new Set(); for (const idx of indices) { keep.add(idx); cum += probs[idx]; if (cum > p) break; } const filtered = probs.map((v, i) => keep.has(i) ? v : 0); const fSum = filtered.reduce((a, b) => a + b); return filtered.map(v => v / fSum); }
Implement repetition penalty: divide the logits of previously-generated tokens by a penalty factor. This discourages the model from repeating itself.
javascript function applyRepetitionPenalty(logits, generatedTokens, penalty) { const result = [...logits]; const seen = new Set(generatedTokens); for (let i = 0; i < result.length; i++) { if (seen.has(i)) result[i] /= penalty; } return result; }
Combine everything: apply temperature, then top-k filtering, then sample. Return the sampled index.
javascript function sampleWithTopK(logits, temperature, k, r) { const scaled = logits.map(x => x / temperature); // top-k filter const sorted = [...scaled].sort((a, b) => b - a); const threshold = sorted[k - 1]; const filtered = scaled.map(x => x >= threshold ? x : -Infinity); // softmax const max = Math.max(...filtered); const exps = filtered.map(x => Math.exp(x - max)); const sum = exps.reduce((a, b) => a + b); const probs = exps.map(e => e / sum); // sample let cum = 0; for (let i = 0; i < probs.length; i++) { cum += probs[i]; if (cum > r) return i; } return probs.length - 1; }