From Shannon entropy to cross-entropy loss — every formula an ML engineer needs, derived by hand with instant feedback.
You flip a fair coin. Before it lands, how "uncertain" are you about the outcome? Intuitively: maximally uncertain between two options. Shannon formalized this intuition into a single number called entropy — measured in bits.
The entropy of a discrete random variable X with probabilities p1, p2, …, pn is:
Why log2? Because one bit is the amount of information needed to distinguish between two equally likely outcomes. Why the negative sign? Because log2(p) is negative when p < 1, and we want entropy to be positive.
Fair coin: p = (0.5, 0.5). H = −0.5·log2(0.5) − 0.5·log2(0.5) = −0.5·(−1) − 0.5·(−1) = 1 bit.
Biased coin: p = (0.9, 0.1). H = −0.9·log2(0.9) − 0.1·log2(0.1) = −0.9·(−0.152) − 0.1·(−3.322) = 0.137 + 0.332 = 0.469 bits. Less uncertainty because you can almost predict the outcome.
English text has about 1.0–1.5 bits per character. Shannon estimated this in 1951 by having people guess the next letter — the fact that English is so predictable (low entropy) is why compression works so well.
A fair 3-sided die has outcomes {1, 2, 3} each with probability 1/3. Compute H(X) in bits.
H = −3 × (1/3)·log2(1/3). Remember: log2(3) ≈ 1.595.
For a fair n-sided die, H = log2(n). This is always the maximum entropy for n outcomes — a uniform distribution maximizes uncertainty.
A 4-sided die has probabilities P = (0.5, 0.25, 0.125, 0.125). Compute H(X).
Maximum entropy for 4 outcomes is log2(4) = 2 bits. This distribution has 1.75 bits — less than maximum because it's non-uniform.
Entropy is maximized by the uniform distribution. Huniform = log2(8) = 3 bits. The deterministic case (1,0,…,0) has H = 0 — no uncertainty at all. The concentrated case (0.9, …) has H ≈ 0.922 bits. The binary case (0.5, 0.5, 0, …) has H = 1 bit.
The binary entropy function Hb(p) = −p·log2(p) − (1−p)·log2(1−p) describes the entropy of a coin with bias p. Compute Hb(0.2).
Write a function that computes the Shannon entropy in bits given an array of probabilities. Handle the convention that 0·log(0) = 0.
javascript function entropy(probs) { let h = 0; for (const p of probs) { if (p > 0) h -= p * Math.log2(p); } return h; }
This entropy function returns negative values. Click the buggy line.
function entropy(probs) { let h = 0; for (const p of probs) { if (p > 0) h += p * Math.log2(p); } return h; }
Line 4 uses += instead of -=. Since log2(p) is negative for 0 < p < 1, summing p * log2(p) produces a negative number. The definition of entropy has a negative sign: H = −∑ p·log(p). Either negate the sum at the end or accumulate with -=.
So far we measured uncertainty in a single variable. But in ML, we almost always care about relationships between variables — input and output, features and labels, encoder and decoder. We need tools for measuring shared and conditional uncertainty.
Joint entropy H(X,Y) measures the total uncertainty in the pair (X,Y):
Conditional entropy H(Y|X) measures the remaining uncertainty in Y once you know X:
Mutual information I(X;Y) measures how much knowing X reduces uncertainty about Y:
A critical inequality: H(Y|X) ≤ H(Y). Conditioning never increases entropy. In the worst case (X and Y are independent), H(Y|X) = H(Y) and mutual information is zero.
Given this joint distribution over (X, Y) where X ∈ {0,1} and Y ∈ {0,1}:
| Y=0 | Y=1 | |
|---|---|---|
| X=0 | 0.4 | 0.1 |
| X=1 | 0.1 | 0.4 |
Compute H(X,Y). Apply the entropy formula to all four joint probabilities.
Using the same table from 1.1: the marginal P(X) = (0.5, 0.5) so H(X) = 1 bit. Compute H(Y|X) using the chain rule.
Knowing X reduces our uncertainty about Y from H(Y) = 1.0 bit to 0.722 bits. The mutual information I(X;Y) = H(Y) − H(Y|X) = 1.0 − 0.722 = 0.278 bits.
For the same table: compute I(X;Y) using the formula I(X;Y) = H(X) + H(Y) − H(X,Y). Note that H(Y) = 1 bit (marginal is also 0.5, 0.5).
This matches I(X;Y) = H(Y) − H(Y|X) = 1.0 − 0.722 = 0.278. Mutual information is symmetric: knowing X tells you as much about Y as knowing Y tells you about X.
If X and Y are independent, H(X,Y) = H(X) + H(Y). So I(X;Y) = H(X) + H(Y) − H(X,Y) = 0. Equivalently, H(Y|X) = H(Y) — learning X doesn't reduce uncertainty about Y at all. Mutual information is always ≥ 0.
You roll two independent fair 6-sided dice. What is H(X,Y)?
For independent variables: H(X,Y) = H(X) + H(Y).
Equivalently: the pair (X,Y) has 36 equally likely outcomes, so H = log2(36) = log2(6²) = 2·log2(6) = 5.170 bits.
Knowing whether it rains reduces temperature uncertainty by 0.4 bits. This is intuitive — rainy days tend to be cooler, so rain gives you partial information about temperature.
Entropy measures uncertainty in a single distribution. But in ML, we constantly compare distributions: How different is my model's output from the true distribution? How much did fine-tuning change the model? KL divergence quantifies this distance.
Read it as: "the KL divergence from Q to P" or "the extra bits needed to encode samples from P using a code optimized for Q." The order matters — P is the "true" distribution, Q is the "approximation."
Gibbs' inequality: DKL(P||Q) ≥ 0, with equality if and only if P = Q. This is one of the most fundamental inequalities in information theory, and it's why minimizing KL divergence is equivalent to finding the closest approximation.
Compute DKL(P||Q) where P = (0.5, 0.5) and Q = (0.9, 0.1).
DKL = 0.5·log2(0.5/0.9) + 0.5·log2(0.5/0.1).
The fair distribution P "wastes" 0.737 bits when encoded using a code optimized for Q. This is asymmetric — compare with DKL(Q||P) in the next exercise.
Now compute DKL(Q||P) where Q = (0.9, 0.1) and P = (0.5, 0.5). Compare with the previous exercise to see asymmetry.
DKL(P||Q) ≠ DKL(Q||P) in general, though in this specific case with P uniform they happen to relate simply. The asymmetry becomes dramatic when Q has a zero probability where P is nonzero — that makes DKL(P||Q) = ∞.
Compute DKL(P||Q) where P = (0.4, 0.3, 0.3) and Q = (0.35, 0.35, 0.3).
Hint: log2(0.4/0.35) = log2(8/7) ≈ 0.193. log2(0.3/0.35) = log2(6/7) ≈ −0.222. log2(0.3/0.3) = 0.
Very close distributions have very small KL. This is why KL works well as a loss — as the model distribution approaches the target, the loss smoothly approaches zero.
DKL(P||Q) requires computing log(p/q). When p > 0 but q = 0, we get log(p/0) = ∞. This is infinite divergence — a code designed for Q literally cannot encode an event that Q says is impossible but P says can happen. This is why in ML, we use label smoothing or add small ε to predicted probabilities.
Cross-entropy is always at least as large as the true entropy H(P). The gap is exactly DKL — the "cost of being wrong." This is why minimizing cross-entropy is equivalent to minimizing KL divergence (since H(P) is fixed).
Write a function that computes DKL(P||Q) in bits. If qi = 0 where pi > 0, return Infinity.
javascript function klDivergence(p, q) { let kl = 0; for (let i = 0; i < p.length; i++) { if (p[i] === 0) continue; if (q[i] === 0) return Infinity; kl += p[i] * Math.log2(p[i] / q[i]); } return kl; }
Here's where information theory meets ML directly. Every time you train a classifier, you're minimizing cross-entropy loss. It's not an arbitrary choice — it's the information-theoretically optimal objective.
Where P is the true distribution (labels) and Q is the model's predicted distribution. In classification, P is a one-hot vector: pc = 1 for the true class c, and 0 elsewhere. This simplifies beautifully:
-log(predicted_probability_of_true_class) is literally the cross-entropy between the one-hot label and the softmax output. Minimizing cross-entropy = minimizing KL divergence from the true label distribution, because H(P,Q) = H(P) + DKL(P||Q) and H(P) is constant.In practice, ML frameworks use natural log (base e) instead of log2, so the loss is in nats rather than bits. The relationship: 1 nat = 1/ln(2) ≈ 1.443 bits.
A model predicts softmax probabilities [0.7, 0.2, 0.1] for a 3-class problem. The true class is 0 (first class). Compute the cross-entropy loss in nats.
For one-hot labels: CE = −ln(qtrue_class). Use ln(0.7) ≈ −0.357.
In bits: 0.357 × 1.443 = 0.515 bits. The loss only depends on the predicted probability of the true class. If the model predicted 0.99 for the true class, the loss would be −ln(0.99) = 0.01 nats — nearly perfect.
The model predicts [0.01, 0.01, 0.98] but the true class is 0. Compute the cross-entropy loss in nats.
The model was very confident in the wrong class. The loss is huge. If the model had predicted exactly 0 for the true class, the loss would be infinity — this is why we never let softmax outputs reach exact zero.
With label smoothing, the target distribution is P = (0.9, 0.05, 0.05) instead of one-hot. Model predicts Q = (0.7, 0.2, 0.1). Compute CE(P,Q) in nats.
Use the full formula: −∑ pi·ln(qi). ln(0.7) ≈ −0.357, ln(0.2) ≈ −1.609, ln(0.1) ≈ −2.303.
With label smoothing, the loss is higher (0.517 vs 0.357) even with the same predictions, because we're penalizing the model for not spreading some probability to the non-true classes. This acts as regularization.
For one-hot labels, cross-entropy and KL divergence are identical. This is why minimizing cross-entropy loss is exactly the same as minimizing KL divergence from the labels. The two objectives are interchangeable.
Write a function that computes cross-entropy in nats. Takes true distribution p and predicted distribution q.
javascript function crossEntropy(p, q) { let ce = 0; for (let i = 0; i < p.length; i++) { if (p[i] > 0) ce -= p[i] * Math.log(q[i]); } return ce; }
A language model achieves a cross-entropy loss of 3.2 nats per token. Convert this to bits per token. (1 bit = ln(2) ≈ 0.693 nats.)
Equivalently: bits = nats × log2(e) = 3.2 × 1.4427 = 4.617. Perplexity = 2bits = 24.617 ≈ 24.5, meaning the model is "as confused as choosing uniformly among 24.5 tokens."
Shannon's source coding theorem (1948) tells us the fundamental limit of lossless compression: you cannot compress data below its entropy rate, on average. Any lossless code must use at least H(X) bits per symbol on average.
Huffman coding achieves this bound to within 1 bit per symbol. The algorithm: assign shorter codes to more probable symbols, longer codes to less probable ones. The construction is bottom-up: repeatedly merge the two lowest-probability symbols.
Five symbols A-E have probabilities P = (0.4, 0.2, 0.2, 0.1, 0.1). Compute the entropy H(X) — this is the theoretical minimum average code length.
Step 1: Merge D(0.1) + E(0.1) = DE(0.2). Queue: A(0.4), B(0.2), C(0.2), DE(0.2).
Step 2: Merge B(0.2) + C(0.2) = BC(0.4). Queue: A(0.4), BC(0.4), DE(0.2).
Wait — we should merge the two smallest: actually after step 1, we pick the two smallest from {A:0.4, B:0.2, C:0.2, DE:0.2}. That's any two of {B, C, DE}. Merge C(0.2) + DE(0.2) = CDE(0.4). Queue: A(0.4), B(0.2), CDE(0.4).
Step 3: Merge B(0.2) + A(0.4) or B(0.2) + CDE(0.4). Merge B(0.2) + CDE(0.4) = BCDE(0.6). Queue: A(0.4), BCDE(0.6).
Step 4: Merge A(0.4) + BCDE(0.6) = root.
Result: A=1 bit (code: 0), B=3 bits (code: 100), C=3 bits (code: 110), D=3 bits (code: 1110), E=3 bits (code: 1111). Actually this depends on tie-breaking. One valid Huffman: A:1, B:3, C:3, D:3, E:3.
Using the Huffman codes A:1 bit, B:3 bits, C:3 bits, D:3 bits, E:3 bits with probabilities (0.4, 0.2, 0.2, 0.1, 0.1), compute the average code length L.
The entropy was 2.122 bits. The Huffman code achieves 2.2 bits — only 0.078 bits above the theoretical minimum. The efficiency is H/L = 2.122/2.2 = 96.5%. A better Huffman tree (codes A:1, B:2, C:3, D:4, E:4 or similar) could get closer.
An alternative valid Huffman tree gives codes A:1 bit, B:2 bits, C:3 bits, D:4 bits, E:4 bits. Compute the average code length with probabilities (0.4, 0.2, 0.2, 0.1, 0.1).
Same average! Different valid Huffman trees can produce different code assignments, but the average code length is always the same. This Huffman average of 2.2 is optimal — no prefix-free code can do better for these probabilities.
A file contains 100,000 characters from alphabet {A,B,C,D,E} with frequencies matching probabilities (0.4, 0.2, 0.2, 0.1, 0.1). Fixed-length encoding uses 3 bits/char. What's the compression ratio using Huffman (2.2 bits/char)?
Compression ratio = uncompressed size / compressed size.
We saved 80,000 bits (10 KB). The theoretical limit would be 100,000 × 2.122 = 212,200 bits, giving ratio 1.414:1. Huffman gets us 96.5% of the way there.
Put these steps of Huffman compression in order.
1. Count frequencies → 2. Build tree (merge lowest-probability nodes) → 3. Assign codes (left=0, right=1 from root) → 4. Encode (replace each symbol with its code) → 5. Header (prepend the codebook so the decoder can reconstruct the tree).
Source coding tells you how much you can compress. Channel coding tells you how fast you can reliably communicate over a noisy channel. Shannon's channel coding theorem (1948): for any rate below the channel capacity C, there exist codes that achieve arbitrarily low error probability.
The binary symmetric channel (BSC) flips each bit independently with probability p:
When p = 0 (noiseless), C = 1 bit per channel use. When p = 0.5 (pure noise), C = 0 — the channel is useless.
where W is bandwidth in Hz and SNR is the signal-to-noise ratio (linear, not dB).
Compute the capacity of a BSC with flip probability p = 0.1. First compute Hb(0.1), then subtract from 1.
Even with 10% bit flips, we can still reliably communicate at 0.531 bits per channel use. The trick is using error-correcting codes that add redundancy — you send more bits than needed, and the decoder corrects the errors.
An AWGN channel has bandwidth W = 1 MHz and SNR = 10 (linear). Compute the capacity in Mbits/second.
Doubling the bandwidth doubles the capacity (linear). But doubling the SNR only adds log2(2) ≈ 1 bit/s/Hz in the high-SNR regime. Bandwidth is precious; power is cheap. This drives engineering decisions in telecommunications.
A Wi-Fi channel has SNR = 20 dB and bandwidth 20 MHz. Convert SNR to linear (SNRlinear = 10SNRdB/10) and compute capacity.
Real Wi-Fi 5 (802.11ac) on a 20 MHz channel achieves about 87 Mbits/s — roughly 65% of Shannon capacity. The gap is due to practical coding, guard intervals, and protocol overhead.
At p = 0.5, the output is a random coin flip regardless of input — zero information passes through. But at p = 1.0, every bit is flipped deterministically. The receiver just inverts every bit and recovers the original perfectly. C = 1 − Hb(1.0) = 1 − 0 = 1 bit. A channel that always flips is just as good as one that never flips!
You need 10 Mbits/s capacity. Option A: W = 5 MHz, find minimum SNR. Option B: W = 2 MHz, find minimum SNR. Compute both.
From C = W·log2(1+SNR): SNR = 2C/W − 1.
Halving the bandwidth requires ~10x more power (SNR). This exponential tradeoff is why bandwidth is the more valuable resource — spectrum is scarce, transmit power is adjustable.
Source coding (Ch 4) was about lossless compression — no information lost. But what if you're willing to tolerate some distortion? JPEG, MP3, video codecs, and neural network quantization all exploit this. Rate-distortion theory tells us the minimum bit rate R needed to represent a source with average distortion at most D.
For a Gaussian source X ~ N(0, σ²) with mean squared error distortion:
A Gaussian source has variance σ² = 1. Compute R(D) for D = 0.1 (low distortion).
To represent a Gaussian source with 10% of its variance as distortion, you need at least 1.661 bits per sample. This is the absolute floor — no quantizer can do better.
Same source (σ² = 1). How many additional bits per sample do you need to halve the distortion from D = 0.1 to D = 0.05?
Halving the distortion costs exactly 0.5 bits per sample for a Gaussian source. This is a universal rule for Gaussian R-D: each halving of distortion adds 0.5 bits. Equivalently, each additional bit reduces distortion by factor of 4 (6 dB per bit).
You're quantizing neural network weights from FP32 to a fixed number of bits. If weights are approximately Gaussian with σ² = 0.01, and you can tolerate MSE distortion D = 0.0001 (1% of variance), how many bits per weight minimum?
This says ~3.3 bits per weight is the theoretical minimum for 1% distortion. In practice, 4-bit quantization (INT4) works well for many models — it's slightly above the theoretical limit. Going to 2-bit quantization would give D = σ²/24 = 0.000625, which is 6.25% of variance — often too much.
R(D) = 0.5·log2(σ²/D). Setting R = 0: 0 = 0.5·log2(σ²/D), so σ²/D = 1, meaning D = σ². With zero bits, the optimal strategy is to always output the mean (0), giving MSE = E[X²] = σ². You can't encode anything, so you just predict the most likely value.
Starting from R(D) = 0.5·log2(σ²/D), show that each additional bit of rate reduces distortion by a factor of 4. If D1 at R bits, what is D2 at R+1 bits? Express D2/D1 as a fraction.
Each additional bit buys a 4x reduction in distortion. In dB: 10·log10(4) = 6.02 dB. This "6 dB per bit" rule is a cornerstone of audio/video engineering and quantization theory.
In 2017, Shwartz-Ziv and Tishby proposed the information bottleneck view of deep learning: a neural network's hidden layers first compress the input (reduce I(X;T)) and then extract relevant information about the output (increase I(T;Y)), where T is the representation at a hidden layer.
The data processing inequality guarantees this: processing can never create information. If X → T → Y is a Markov chain, then I(X;Y) ≥ I(T;Y). A network can only preserve or lose information about the label, never create it from nothing.
The DPI says: for X → T → Y, I(X;Y) ≥ I(T;Y). Each processing step can only lose information. So I(X;Y) ≥ I(T1;Y) ≥ I(T2;Y). A network must preserve the right information — it can't synthesize new predictive information about Y.
A binary classifier takes input X ∈ {0,1} (uniform) through a noisy channel with P(Y=1|X=1) = 0.9 and P(Y=1|X=0) = 0.2. Compute I(X;Y).
Compute: H(Y) from marginal P(Y), then H(Y|X) from the conditional distributions, then I = H(Y) − H(Y|X).
Note Hb(0.9) = Hb(0.1) = 0.469 since Hb(p) = Hb(1−p). Knowing X reduces uncertainty about Y by 0.397 bits — a substantial fraction of H(Y). The channel is informative but noisy.
A sufficient statistic T captures everything about Y that X contained, while potentially discarding irrelevant details of X. This is the ideal: I(X;T) can be much smaller than H(X) (compression), but I(T;Y) = I(X;Y) (no label information lost). The information bottleneck objective seeks exactly this: minimize I(X;T) subject to preserving I(T;Y).
The IB objective minimizes L = I(X;T) − β·I(T;Y). If I(X;Y) = 0.8 bits and the optimal representation achieves I(T;Y) = 0.8 bits (sufficient) with I(X;T) = 2.0 bits, compute L for β = 1.0 and β = 5.0.
Higher β means we value label information more relative to compression. At β = 1, the cost of storing 2 bits dominates. At β = 5, the label information is so valuable that the Lagrangian is very negative (good). The IB method sweeps β to trace the optimal tradeoff curve.
MI = H(T) − H(T|Y) requires knowing the distribution of T. For a 768-dimensional hidden representation, estimating this density is essentially impossible with finite samples. This is the "curse of dimensionality" for MI estimation. Modern approaches use variational bounds (MINE, InfoNCE) or binning with caveats. This difficulty is why the information plane theory remains controversial.
The variational autoencoder (VAE) has a deep connection to information theory. Its training objective, the ELBO (Evidence Lower Bound), is literally a coding cost:
The first term is the reconstruction quality (how well the decoder recovers x from z). The second term is the information cost — how many extra bits the encoder uses beyond the prior. The ELBO is a lower bound on log p(x), the log-likelihood of the data.
For Gaussian encoder q(z|x) = N(μ, σ²) and standard normal prior p(z) = N(0,1), the KL has a closed form:
The VAE encoder outputs μ = 0.5, σ = 0.8 for one latent dimension. The prior is N(0,1). Compute the KL divergence in nats.
For q = N(μ, σ²) vs p = N(0,1): KL = −0.5·(1 + log(σ²) − μ² − σ²). Use ln(0.64) ≈ −0.446.
The encoder is moderately close to the prior — μ is only 0.5 away from 0, and σ = 0.8 is fairly close to 1. The KL cost per dimension is small. In a VAE with hundreds of latent dimensions, these small per-dimension costs add up.
Compute DKL(N(1, 0.5²) || N(0, 1²)). Use the full formula: KL = ln(σ2/σ1) + (σ1² + (μ1−μ2)²)/(2σ2²) − 0.5.
ELBO = E[log p(x|z)] − KL. The reconstruction term is a negative log-likelihood, typically reported as a positive loss. So ELBO = −50.0 − 3.2 = −53.2. Maximizing the ELBO (making it less negative) means: (1) improve reconstruction (decrease the 50.0), and (2) keep the posterior close to the prior (decrease the 3.2). In practice, we minimize −ELBO = 53.2.
Posterior collapse: q(z|x) = p(z) regardless of x, so KL = 0 — the "information cost" is zero because the encoder transmits zero information. But this means the decoder gets no useful signal from z and must model p(x) entirely on its own, giving poor reconstruction. The fix: KL annealing — start with low KL weight and gradually increase it, letting the encoder learn useful representations before being penalized for diverging from the prior.
A VAE has latent dimension d = 3. The encoder outputs μ = (0.5, −0.3, 0.8) and σ = (0.7, 1.2, 0.4). The prior is N(0, I). Compute the total KL (sum over dimensions).
For each dimension j: KLj = −0.5(1 + ln(σj²) − μj² − σj²). Sum them.
Total = 0.227 + 0.083 + 0.816 = 1.126 nats. Dimension 2 dominates: it has both a shifted mean (μ=0.8) and a narrow variance (σ=0.4), making it most different from the N(0,1) prior. Dimension 1 contributes least because its wider variance (σ=1.2) partially offsets the small mean shift.
This VAE KL computation produces negative values. Click the buggy line.
function vaeKL(mu, logvar) { // mu, logvar: arrays of length d let kl = 0; for (let i = 0; i < mu.length; i++) { const sigma2 = Math.exp(logvar[i]); kl += 0.5 * (1 + logvar[i] - mu[i]*mu[i] - sigma2); } return kl; }
Line 6 computes the negative ELBO contribution (which is negative, not the KL itself). The KL divergence is KL = −0.5·(1 + log(σ²) − μ² − σ²). The line is missing the negative sign. It should be: kl += -0.5 * (1 + logvar[i] - mu[i]*mu[i] - sigma2);. Without the negation, the sum is negative (since σ² + μ² > 1 + log(σ²) for typical values), making KL appear negative, which violates KL ≥ 0.
You've built the full toolkit. Now let's analyze a real-world compression scenario end-to-end: from character frequencies to Huffman bounds to LLM-as-compressor. This is the kind of back-of-envelope analysis that separates someone who understands information theory from someone who just memorized formulas.
Convert the frequency counts to probabilities (divide by 10,000) and compute H(X). The probabilities are: space=0.2043, E=0.127, T=0.0906, A=0.0817, O=0.0751, I=0.0697, N=0.0675, S=0.0633, H=0.0609, R=0.0599, other=0.1.
Use the formula H = −∑ pi·log2(pi). This has 11 terms.
This is the zero-order (unigram) entropy — treating each character as independent. Real English text has much lower entropy (~1.0–1.5 bits/char) because of sequential dependencies. The gap between unigram entropy and true entropy is what n-gram and neural language models exploit.
Given the unigram entropy from 9.1 (~3.3 bits/char), what is the theoretical minimum file size for the 10,000-character file? Express in bytes (8 bits = 1 byte). Compare to the naive ASCII size (8 bits/char).
We can compress to ~41% of the original size using just unigram statistics. Real compressors (gzip, zstd) do even better because they exploit sequential patterns. The Shannon limit with full context is ~1.2 bits/char × 10,000 = 1,500 bytes — a 6.67:1 ratio.
Huffman: H ≤ L < H + 1 (per symbol). Arithmetic coding can approach H to within O(1/n) total bits for a sequence of length n. For long sequences, arithmetic coding is near-optimal. This is why modern compressors (zlib, zstd, ANS) use arithmetic-coding variants rather than Huffman.
A language model achieves perplexity 15.3 on a test set. Convert this to bits per character (assuming character-level tokenization). Perplexity = 2H where H is the cross-entropy in bits.
This model is worse than unigram Huffman (3.3 bits/char)! It must be a bad model or the test set is unusual. GPT-4-class models achieve ~0.7–0.9 bits per character on standard English, approaching the Shannon limit. A perplexity of 15.3 at the character level is quite high.
Most LLMs use subword tokenization. GPT-2 achieves perplexity 29.4 on WikiText-103 with an average of 1.3 tokens per word and 4.5 characters per word. Convert to bits per character.
bits/token = log2(PPL). tokens/char = 1.3/4.5. bits/char = bits/token × tokens/char.
~1.4 bits per character is in the range of Shannon's estimate for English (1.0–1.5 bits/char). This means GPT-2 captures most of the statistical structure of English text. Better models (GPT-4) push this even lower, suggesting the true entropy of English may be below 1.0 bits/char.
For our 10,000-character file, compute the compressed size (in bytes) for each method: (a) Unigram Huffman at 3.3 bits/char, (b) gzip-like at ~2.0 bits/char, (c) LLM-based at 1.4 bits/char. Express all as bytes.
From 10 KB original to 1.75 KB — a 5.7:1 compression ratio. The progression shows how each level of modeling captures more structure: unigram frequencies → local patterns (gzip) → full linguistic context (LLM). This is why "language modeling is compression" is not just a metaphor — a better language model is literally a better compressor.
Order these compression approaches from worst to best compression ratio on English text.
Worst → Best compression: ASCII (no compression) → Huffman (unigram model) → gzip (local pattern matching) → LLM (full context model) → Shannon limit (theoretical floor). Each step exploits more structure in the data. The Shannon limit is unreachable but approachable.
Write a function that takes a perplexity value and characters-per-token ratio, and returns bits per character.
javascript function bitsPerChar(perplexity, charsPerToken) { const bitsPerToken = Math.log2(perplexity); return bitsPerToken / charsPerToken; }