Workbook — Information Theory

Information Theory

From Shannon entropy to cross-entropy loss — every formula an ML engineer needs, derived by hand with instant feedback.

Prerequisites: Basic probability (distributions, expected value) + Logarithms (log rules). That's it.
10
Chapters
59
Exercises
5
Exercise Types
Mastery
0 / 59 exercises (0%)
0
Day Streak
Best: 0

Chapter 0: Entropy

You flip a fair coin. Before it lands, how "uncertain" are you about the outcome? Intuitively: maximally uncertain between two options. Shannon formalized this intuition into a single number called entropy — measured in bits.

The entropy of a discrete random variable X with probabilities p1, p2, …, pn is:

H(X) = −∑i pi · log2(pi)

Why log2? Because one bit is the amount of information needed to distinguish between two equally likely outcomes. Why the negative sign? Because log2(p) is negative when p < 1, and we want entropy to be positive.

Entropy = expected surprise. The quantity −log2(pi) is called the surprisal or self-information of outcome i. A rare event (small p) has high surprisal. Entropy is just the average surprisal, weighted by probability. High entropy means "on average, outcomes are surprising" — i.e., the distribution is spread out.

Fair coin: p = (0.5, 0.5). H = −0.5·log2(0.5) − 0.5·log2(0.5) = −0.5·(−1) − 0.5·(−1) = 1 bit.

Biased coin: p = (0.9, 0.1). H = −0.9·log2(0.9) − 0.1·log2(0.1) = −0.9·(−0.152) − 0.1·(−3.322) = 0.137 + 0.332 = 0.469 bits. Less uncertainty because you can almost predict the outcome.

English text has about 1.0–1.5 bits per character. Shannon estimated this in 1951 by having people guess the next letter — the fact that English is so predictable (low entropy) is why compression works so well.

Exercise 0.1: Three-Sided Die Derive

A fair 3-sided die has outcomes {1, 2, 3} each with probability 1/3. Compute H(X) in bits.

H = −3 × (1/3)·log2(1/3). Remember: log2(3) ≈ 1.595.

bits
Show derivation
H = −3 × (1/3) × log2(1/3)
= −3 × (1/3) × (−1.595)
= 3 × (1/3) × 1.595 = 1.595 bits

For a fair n-sided die, H = log2(n). This is always the maximum entropy for n outcomes — a uniform distribution maximizes uncertainty.

Exercise 0.2: Weighted Die Derive

A 4-sided die has probabilities P = (0.5, 0.25, 0.125, 0.125). Compute H(X).

bits
Show derivation
H = −[0.5·log2(0.5) + 0.25·log2(0.25) + 0.125·log2(0.125) + 0.125·log2(0.125)]
= −[0.5·(−1) + 0.25·(−2) + 0.125·(−3) + 0.125·(−3)]
= −[−0.5 − 0.5 − 0.375 − 0.375]
= 1.75 bits

Maximum entropy for 4 outcomes is log2(4) = 2 bits. This distribution has 1.75 bits — less than maximum because it's non-uniform.

Exercise 0.3: Maximum Entropy Trace
Which distribution over 8 outcomes has the highest entropy?
Show reasoning

Entropy is maximized by the uniform distribution. Huniform = log2(8) = 3 bits. The deterministic case (1,0,…,0) has H = 0 — no uncertainty at all. The concentrated case (0.9, …) has H ≈ 0.922 bits. The binary case (0.5, 0.5, 0, …) has H = 1 bit.

Exercise 0.4: Binary Entropy Function Derive

The binary entropy function Hb(p) = −p·log2(p) − (1−p)·log2(1−p) describes the entropy of a coin with bias p. Compute Hb(0.2).

bits
Show derivation
Hb(0.2) = −0.2·log2(0.2) − 0.8·log2(0.8)
= −0.2·(−2.322) − 0.8·(−0.322)
= 0.464 + 0.259 = 0.722 bits
Exercise 0.5: Implement entropy() Build

Write a function that computes the Shannon entropy in bits given an array of probabilities. Handle the convention that 0·log(0) = 0.

Use Math.log2(). Skip any p === 0.
Show solution
javascript
function entropy(probs) {
  let h = 0;
  for (const p of probs) {
    if (p > 0) h -= p * Math.log2(p);
  }
  return h;
}
Exercise 0.6: Find the Bug Debug

This entropy function returns negative values. Click the buggy line.

function entropy(probs) {
  let h = 0;
  for (const p of probs) {
    if (p > 0) h += p * Math.log2(p);
  }
  return h;
}
Show explanation

Line 4 uses += instead of -=. Since log2(p) is negative for 0 < p < 1, summing p * log2(p) produces a negative number. The definition of entropy has a negative sign: H = ∑ p·log(p). Either negate the sum at the end or accumulate with -=.

Chapter 1: Joint & Conditional Entropy

So far we measured uncertainty in a single variable. But in ML, we almost always care about relationships between variables — input and output, features and labels, encoder and decoder. We need tools for measuring shared and conditional uncertainty.

Joint entropy H(X,Y) measures the total uncertainty in the pair (X,Y):

H(X,Y) = −∑x,y p(x,y) · log2 p(x,y)

Conditional entropy H(Y|X) measures the remaining uncertainty in Y once you know X:

H(Y|X) = H(X,Y) − H(X)
Chain rule of entropy. H(X,Y) = H(X) + H(Y|X). Total uncertainty = uncertainty in X + leftover uncertainty in Y after seeing X. This is the information-theoretic analog of the probability chain rule p(x,y) = p(x)·p(y|x).

Mutual information I(X;Y) measures how much knowing X reduces uncertainty about Y:

I(X;Y) = H(Y) − H(Y|X) = H(X) − H(X|Y) = H(X) + H(Y) − H(X,Y)

A critical inequality: H(Y|X) ≤ H(Y). Conditioning never increases entropy. In the worst case (X and Y are independent), H(Y|X) = H(Y) and mutual information is zero.

Exercise 1.1: Joint Entropy from Table Derive

Given this joint distribution over (X, Y) where X ∈ {0,1} and Y ∈ {0,1}:

Y=0Y=1
X=00.40.1
X=10.10.4

Compute H(X,Y). Apply the entropy formula to all four joint probabilities.

bits
Show derivation
H(X,Y) = −[2×0.4·log2(0.4) + 2×0.1·log2(0.1)]
= −[2×0.4×(−1.322) + 2×0.1×(−3.322)]
= −[−1.059 − 0.664] = 1.722 bits
Exercise 1.2: Conditional Entropy Derive

Using the same table from 1.1: the marginal P(X) = (0.5, 0.5) so H(X) = 1 bit. Compute H(Y|X) using the chain rule.

bits
Show derivation
H(Y|X) = H(X,Y) − H(X) = 1.722 − 1.0 = 0.722 bits

Knowing X reduces our uncertainty about Y from H(Y) = 1.0 bit to 0.722 bits. The mutual information I(X;Y) = H(Y) − H(Y|X) = 1.0 − 0.722 = 0.278 bits.

Exercise 1.3: Mutual Information Derive

For the same table: compute I(X;Y) using the formula I(X;Y) = H(X) + H(Y) − H(X,Y). Note that H(Y) = 1 bit (marginal is also 0.5, 0.5).

bits
Show derivation
I(X;Y) = H(X) + H(Y) − H(X,Y) = 1 + 1 − 1.722 = 0.278 bits

This matches I(X;Y) = H(Y) − H(Y|X) = 1.0 − 0.722 = 0.278. Mutual information is symmetric: knowing X tells you as much about Y as knowing Y tells you about X.

Exercise 1.4: Independent Variables Trace
If X and Y are independent (p(x,y) = p(x)·p(y) for all x,y), what is I(X;Y)?
Show reasoning

If X and Y are independent, H(X,Y) = H(X) + H(Y). So I(X;Y) = H(X) + H(Y) − H(X,Y) = 0. Equivalently, H(Y|X) = H(Y) — learning X doesn't reduce uncertainty about Y at all. Mutual information is always ≥ 0.

Exercise 1.5: Joint Entropy of Independent Dice Derive

You roll two independent fair 6-sided dice. What is H(X,Y)?

For independent variables: H(X,Y) = H(X) + H(Y).

bits
Show derivation
H(X) = log2(6) = 2.595 bits
H(X,Y) = H(X) + H(Y) = 2 × 2.595 = 5.170 bits

Equivalently: the pair (X,Y) has 36 equally likely outcomes, so H = log2(36) = log2(6²) = 2·log2(6) = 5.170 bits.

Exercise 1.6: Conditioning Reduces Entropy Trace
A weather model predicts rain (R) and temperature (T). You observe that H(T) = 3.2 bits, H(T|R) = 2.8 bits, and H(R) = 0.9 bits. What is the mutual information I(R;T)?
Show reasoning
I(R;T) = H(T) − H(T|R) = 3.2 − 2.8 = 0.4 bits

Knowing whether it rains reduces temperature uncertainty by 0.4 bits. This is intuitive — rainy days tend to be cooler, so rain gives you partial information about temperature.

Chapter 2: KL Divergence

Entropy measures uncertainty in a single distribution. But in ML, we constantly compare distributions: How different is my model's output from the true distribution? How much did fine-tuning change the model? KL divergence quantifies this distance.

DKL(P || Q) = ∑i pi · log2(pi / qi)

Read it as: "the KL divergence from Q to P" or "the extra bits needed to encode samples from P using a code optimized for Q." The order matters — P is the "true" distribution, Q is the "approximation."

KL is NOT symmetric. DKL(P||Q) ≠ DKL(Q||P) in general. This is why KL is not a true "distance." It measures the penalty for using Q when the truth is P, which is a different question from using P when the truth is Q.

Gibbs' inequality: DKL(P||Q) ≥ 0, with equality if and only if P = Q. This is one of the most fundamental inequalities in information theory, and it's why minimizing KL divergence is equivalent to finding the closest approximation.

Exercise 2.1: KL for Fair vs. Biased Derive

Compute DKL(P||Q) where P = (0.5, 0.5) and Q = (0.9, 0.1).

DKL = 0.5·log2(0.5/0.9) + 0.5·log2(0.5/0.1).

bits
Show derivation
DKL(P||Q) = 0.5 · log2(0.5/0.9) + 0.5 · log2(0.5/0.1)
= 0.5 · log2(0.556) + 0.5 · log2(5.0)
= 0.5 · (−0.848) + 0.5 · 2.322
= −0.424 + 1.161 = 0.737 bits

The fair distribution P "wastes" 0.737 bits when encoded using a code optimized for Q. This is asymmetric — compare with DKL(Q||P) in the next exercise.

Exercise 2.2: KL in the Other Direction Derive

Now compute DKL(Q||P) where Q = (0.9, 0.1) and P = (0.5, 0.5). Compare with the previous exercise to see asymmetry.

bits
Show derivation
DKL(Q||P) = 0.9 · log2(0.9/0.5) + 0.1 · log2(0.1/0.5)
= 0.9 · log2(1.8) + 0.1 · log2(0.2)
= 0.9 · 0.848 + 0.1 · (−2.322)
= 0.763 − 0.232 = 0.531 bits

DKL(P||Q) ≠ DKL(Q||P) in general, though in this specific case with P uniform they happen to relate simply. The asymmetry becomes dramatic when Q has a zero probability where P is nonzero — that makes DKL(P||Q) = ∞.

Exercise 2.3: KL Between Close Distributions Derive

Compute DKL(P||Q) where P = (0.4, 0.3, 0.3) and Q = (0.35, 0.35, 0.3).

Hint: log2(0.4/0.35) = log2(8/7) ≈ 0.193. log2(0.3/0.35) = log2(6/7) ≈ −0.222. log2(0.3/0.3) = 0.

bits
Show derivation
DKL = 0.4·log2(8/7) + 0.3·log2(6/7) + 0.3·log2(1)
= 0.4 × 0.193 + 0.3 × (−0.222) + 0
= 0.0772 − 0.0666 = 0.0106 bits

Very close distributions have very small KL. This is why KL works well as a loss — as the model distribution approaches the target, the loss smoothly approaches zero.

Exercise 2.4: When KL Explodes Trace
P = (0.5, 0.5) and Q = (1.0, 0.0). What is DKL(P||Q)?
Show reasoning

DKL(P||Q) requires computing log(p/q). When p > 0 but q = 0, we get log(p/0) = ∞. This is infinite divergence — a code designed for Q literally cannot encode an event that Q says is impossible but P says can happen. This is why in ML, we use label smoothing or add small ε to predicted probabilities.

Exercise 2.5: KL Decomposition Trace
The fundamental decomposition is: H(P,Q) = H(P) + DKL(P||Q), where H(P,Q) is cross-entropy. If H(P) = 1.5 bits and DKL(P||Q) = 0.3 bits, what is the cross-entropy?
Show reasoning
CE(P,Q) = H(P) + DKL(P||Q) = 1.5 + 0.3 = 1.8 bits

Cross-entropy is always at least as large as the true entropy H(P). The gap is exactly DKL — the "cost of being wrong." This is why minimizing cross-entropy is equivalent to minimizing KL divergence (since H(P) is fixed).

Exercise 2.6: Implement klDivergence() Build

Write a function that computes DKL(P||Q) in bits. If qi = 0 where pi > 0, return Infinity.

Use Math.log2(). Check for q[i] === 0 when p[i] > 0.
Show solution
javascript
function klDivergence(p, q) {
  let kl = 0;
  for (let i = 0; i < p.length; i++) {
    if (p[i] === 0) continue;
    if (q[i] === 0) return Infinity;
    kl += p[i] * Math.log2(p[i] / q[i]);
  }
  return kl;
}

Chapter 3: Cross-Entropy

Here's where information theory meets ML directly. Every time you train a classifier, you're minimizing cross-entropy loss. It's not an arbitrary choice — it's the information-theoretically optimal objective.

H(P, Q) = −∑i pi · log2(qi)

Where P is the true distribution (labels) and Q is the model's predicted distribution. In classification, P is a one-hot vector: pc = 1 for the true class c, and 0 elsewhere. This simplifies beautifully:

H(P, Q) = −log2(qc) = −log(qc) / log(2)
Cross-entropy IS the ML loss. The standard classification loss -log(predicted_probability_of_true_class) is literally the cross-entropy between the one-hot label and the softmax output. Minimizing cross-entropy = minimizing KL divergence from the true label distribution, because H(P,Q) = H(P) + DKL(P||Q) and H(P) is constant.

In practice, ML frameworks use natural log (base e) instead of log2, so the loss is in nats rather than bits. The relationship: 1 nat = 1/ln(2) ≈ 1.443 bits.

Exercise 3.1: Classification Loss Derive

A model predicts softmax probabilities [0.7, 0.2, 0.1] for a 3-class problem. The true class is 0 (first class). Compute the cross-entropy loss in nats.

For one-hot labels: CE = −ln(qtrue_class). Use ln(0.7) ≈ −0.357.

nats
Show derivation
CE = −ln(0.7) = 0.357 nats

In bits: 0.357 × 1.443 = 0.515 bits. The loss only depends on the predicted probability of the true class. If the model predicted 0.99 for the true class, the loss would be −ln(0.99) = 0.01 nats — nearly perfect.

Exercise 3.2: Worst-Case Loss Derive

The model predicts [0.01, 0.01, 0.98] but the true class is 0. Compute the cross-entropy loss in nats.

nats
Show derivation
CE = −ln(0.01) = ln(100) = 4.605 nats

The model was very confident in the wrong class. The loss is huge. If the model had predicted exactly 0 for the true class, the loss would be infinity — this is why we never let softmax outputs reach exact zero.

Exercise 3.3: Full Cross-Entropy (Soft Labels) Derive

With label smoothing, the target distribution is P = (0.9, 0.05, 0.05) instead of one-hot. Model predicts Q = (0.7, 0.2, 0.1). Compute CE(P,Q) in nats.

Use the full formula: −∑ pi·ln(qi). ln(0.7) ≈ −0.357, ln(0.2) ≈ −1.609, ln(0.1) ≈ −2.303.

nats
Show derivation
CE = −[0.9·ln(0.7) + 0.05·ln(0.2) + 0.05·ln(0.1)]
= −[0.9×(−0.357) + 0.05×(−1.609) + 0.05×(−2.303)]
= −[−0.321 − 0.080 − 0.115]
= 0.517 nats

With label smoothing, the loss is higher (0.517 vs 0.357) even with the same predictions, because we're penalizing the model for not spreading some probability to the non-true classes. This acts as regularization.

Exercise 3.4: CE vs KL Relationship Trace
A one-hot distribution P = (1, 0, 0, 0) has entropy H(P) = 0. If CE(P, Q) = 2.3 nats, what is DKL(P||Q)?
Show reasoning
CE(P,Q) = H(P) + DKL(P||Q)
2.3 = 0 + DKL(P||Q)
DKL(P||Q) = 2.3 nats

For one-hot labels, cross-entropy and KL divergence are identical. This is why minimizing cross-entropy loss is exactly the same as minimizing KL divergence from the labels. The two objectives are interchangeable.

Exercise 3.5: Implement crossEntropy() Build

Write a function that computes cross-entropy in nats. Takes true distribution p and predicted distribution q.

CE = -sum(p[i] * ln(q[i])). Skip terms where p[i] = 0.
Show solution
javascript
function crossEntropy(p, q) {
  let ce = 0;
  for (let i = 0; i < p.length; i++) {
    if (p[i] > 0) ce -= p[i] * Math.log(q[i]);
  }
  return ce;
}
Exercise 3.6: Nats to Bits Derive

A language model achieves a cross-entropy loss of 3.2 nats per token. Convert this to bits per token. (1 bit = ln(2) ≈ 0.693 nats.)

bits/token
Show derivation
bits = nats / ln(2) = 3.2 / 0.693 = 4.617 bits/token

Equivalently: bits = nats × log2(e) = 3.2 × 1.4427 = 4.617. Perplexity = 2bits = 24.617 ≈ 24.5, meaning the model is "as confused as choosing uniformly among 24.5 tokens."

Chapter 4: Source Coding

Shannon's source coding theorem (1948) tells us the fundamental limit of lossless compression: you cannot compress data below its entropy rate, on average. Any lossless code must use at least H(X) bits per symbol on average.

Average code length L ≥ H(X)

Huffman coding achieves this bound to within 1 bit per symbol. The algorithm: assign shorter codes to more probable symbols, longer codes to less probable ones. The construction is bottom-up: repeatedly merge the two lowest-probability symbols.

Why Huffman beats fixed-length. With 5 symbols, a fixed-length code needs ⌈log2(5)⌉ = 3 bits per symbol. But if one symbol appears 40% of the time, Huffman can give it a 1-bit code and push rare symbols to 3-4 bits. The average drops below 3.
Exercise 4.1: Entropy Lower Bound Derive

Five symbols A-E have probabilities P = (0.4, 0.2, 0.2, 0.1, 0.1). Compute the entropy H(X) — this is the theoretical minimum average code length.

bits/symbol
Show derivation
H = −[0.4·log2(0.4) + 2×0.2·log2(0.2) + 2×0.1·log2(0.1)]
= −[0.4×(−1.322) + 2×0.2×(−2.322) + 2×0.1×(−3.322)]
= 0.529 + 0.929 + 0.664 = 2.122 bits/symbol
Exercise 4.2: Build the Huffman Tree Trace
For P = (A:0.4, B:0.2, C:0.2, D:0.1, E:0.1), the Huffman algorithm first merges the two smallest. What are the correct Huffman code lengths for each symbol?
Show Huffman construction

Step 1: Merge D(0.1) + E(0.1) = DE(0.2). Queue: A(0.4), B(0.2), C(0.2), DE(0.2).

Step 2: Merge B(0.2) + C(0.2) = BC(0.4). Queue: A(0.4), BC(0.4), DE(0.2).

Wait — we should merge the two smallest: actually after step 1, we pick the two smallest from {A:0.4, B:0.2, C:0.2, DE:0.2}. That's any two of {B, C, DE}. Merge C(0.2) + DE(0.2) = CDE(0.4). Queue: A(0.4), B(0.2), CDE(0.4).

Step 3: Merge B(0.2) + A(0.4) or B(0.2) + CDE(0.4). Merge B(0.2) + CDE(0.4) = BCDE(0.6). Queue: A(0.4), BCDE(0.6).

Step 4: Merge A(0.4) + BCDE(0.6) = root.

Result: A=1 bit (code: 0), B=3 bits (code: 100), C=3 bits (code: 110), D=3 bits (code: 1110), E=3 bits (code: 1111). Actually this depends on tie-breaking. One valid Huffman: A:1, B:3, C:3, D:3, E:3.

Exercise 4.3: Average Code Length Derive

Using the Huffman codes A:1 bit, B:3 bits, C:3 bits, D:3 bits, E:3 bits with probabilities (0.4, 0.2, 0.2, 0.1, 0.1), compute the average code length L.

bits/symbol
Show derivation
L = 0.4×1 + 0.2×3 + 0.2×3 + 0.1×3 + 0.1×3
= 0.4 + 0.6 + 0.6 + 0.3 + 0.3 = 2.2 bits/symbol

The entropy was 2.122 bits. The Huffman code achieves 2.2 bits — only 0.078 bits above the theoretical minimum. The efficiency is H/L = 2.122/2.2 = 96.5%. A better Huffman tree (codes A:1, B:2, C:3, D:4, E:4 or similar) could get closer.

Exercise 4.4: Optimal Huffman Derive

An alternative valid Huffman tree gives codes A:1 bit, B:2 bits, C:3 bits, D:4 bits, E:4 bits. Compute the average code length with probabilities (0.4, 0.2, 0.2, 0.1, 0.1).

bits/symbol
Show derivation
L = 0.4×1 + 0.2×2 + 0.2×3 + 0.1×4 + 0.1×4
= 0.4 + 0.4 + 0.6 + 0.4 + 0.4 = 2.2 bits/symbol

Same average! Different valid Huffman trees can produce different code assignments, but the average code length is always the same. This Huffman average of 2.2 is optimal — no prefix-free code can do better for these probabilities.

Exercise 4.5: Compression Ratio Derive

A file contains 100,000 characters from alphabet {A,B,C,D,E} with frequencies matching probabilities (0.4, 0.2, 0.2, 0.1, 0.1). Fixed-length encoding uses 3 bits/char. What's the compression ratio using Huffman (2.2 bits/char)?

Compression ratio = uncompressed size / compressed size.

:1 ratio
Show derivation
Uncompressed = 100,000 × 3 = 300,000 bits
Compressed = 100,000 × 2.2 = 220,000 bits
Ratio = 300,000 / 220,000 = 1.364:1

We saved 80,000 bits (10 KB). The theoretical limit would be 100,000 × 2.122 = 212,200 bits, giving ratio 1.414:1. Huffman gets us 96.5% of the way there.

Exercise 4.6: Arrange the Compression Pipeline Design

Put these steps of Huffman compression in order.

?
?
?
?
?
Count symbol frequencies Build Huffman tree Assign codes from tree Replace symbols with codes Prepend codebook header
Show correct order

1. Count frequencies2. Build tree (merge lowest-probability nodes) → 3. Assign codes (left=0, right=1 from root) → 4. Encode (replace each symbol with its code) → 5. Header (prepend the codebook so the decoder can reconstruct the tree).

Chapter 5: Channel Capacity

Source coding tells you how much you can compress. Channel coding tells you how fast you can reliably communicate over a noisy channel. Shannon's channel coding theorem (1948): for any rate below the channel capacity C, there exist codes that achieve arbitrarily low error probability.

The binary symmetric channel (BSC) flips each bit independently with probability p:

CBSC = 1 − Hb(p) = 1 + p·log2(p) + (1−p)·log2(1−p)

When p = 0 (noiseless), C = 1 bit per channel use. When p = 0.5 (pure noise), C = 0 — the channel is useless.

The AWGN channel (additive white Gaussian noise) models continuous signals with Gaussian noise. Its capacity is the famous Shannon-Hartley theorem:
C = W · log2(1 + SNR)   bits/second

where W is bandwidth in Hz and SNR is the signal-to-noise ratio (linear, not dB).

Exercise 5.1: BSC Capacity (p=0.1) Derive

Compute the capacity of a BSC with flip probability p = 0.1. First compute Hb(0.1), then subtract from 1.

bits/use
Show derivation
Hb(0.1) = −0.1·log2(0.1) − 0.9·log2(0.9)
= −0.1×(−3.322) − 0.9×(−0.152)
= 0.332 + 0.137 = 0.469 bits
C = 1 − 0.469 = 0.531 bits/channel use

Even with 10% bit flips, we can still reliably communicate at 0.531 bits per channel use. The trick is using error-correcting codes that add redundancy — you send more bits than needed, and the decoder corrects the errors.

Exercise 5.2: AWGN Channel Derive

An AWGN channel has bandwidth W = 1 MHz and SNR = 10 (linear). Compute the capacity in Mbits/second.

Mbits/s
Show derivation
C = 1,000,000 × log2(1 + 10)
= 1,000,000 × log2(11)
= 1,000,000 × 3.459 = 3.459 Mbits/s

Doubling the bandwidth doubles the capacity (linear). But doubling the SNR only adds log2(2) ≈ 1 bit/s/Hz in the high-SNR regime. Bandwidth is precious; power is cheap. This drives engineering decisions in telecommunications.

Exercise 5.3: SNR in dB Derive

A Wi-Fi channel has SNR = 20 dB and bandwidth 20 MHz. Convert SNR to linear (SNRlinear = 10SNRdB/10) and compute capacity.

Mbits/s
Show derivation
SNRlinear = 1020/10 = 102 = 100
C = 20 × 106 × log2(101)
= 20 × 106 × 6.659
= 133.15 Mbits/s

Real Wi-Fi 5 (802.11ac) on a 20 MHz channel achieves about 87 Mbits/s — roughly 65% of Shannon capacity. The gap is due to practical coding, guard intervals, and protocol overhead.

Exercise 5.4: Useless Channel Trace
A BSC with p = 0.5 has capacity C = 0. Why is a coin-flip channel useless, and what about p = 1.0?
Show reasoning

At p = 0.5, the output is a random coin flip regardless of input — zero information passes through. But at p = 1.0, every bit is flipped deterministically. The receiver just inverts every bit and recovers the original perfectly. C = 1 − Hb(1.0) = 1 − 0 = 1 bit. A channel that always flips is just as good as one that never flips!

Exercise 5.5: Bandwidth vs Power Tradeoff Derive

You need 10 Mbits/s capacity. Option A: W = 5 MHz, find minimum SNR. Option B: W = 2 MHz, find minimum SNR. Compute both.

From C = W·log2(1+SNR): SNR = 2C/W − 1.

SNR (linear) for Option A
Show derivation
Option A: SNR = 210/5 − 1 = 22 − 1 = 3
Option B: SNR = 210/2 − 1 = 25 − 1 = 31

Halving the bandwidth requires ~10x more power (SNR). This exponential tradeoff is why bandwidth is the more valuable resource — spectrum is scarce, transmit power is adjustable.

Chapter 6: Rate-Distortion

Source coding (Ch 4) was about lossless compression — no information lost. But what if you're willing to tolerate some distortion? JPEG, MP3, video codecs, and neural network quantization all exploit this. Rate-distortion theory tells us the minimum bit rate R needed to represent a source with average distortion at most D.

R(D) = minp(x̂|x) I(X; X̂)   subject to E[d(X, X̂)] ≤ D

For a Gaussian source X ~ N(0, σ²) with mean squared error distortion:

R(D) = 0.5 · log2(σ² / D)   for D ≤ σ²
The lossy compression tradeoff. R(D) is a decreasing convex function. Allowing a little distortion saves a lot of bits. Allowing a lot of distortion saves diminishing additional bits. This curve is the fundamental limit every lossy codec operates against.
Exercise 6.1: Gaussian Rate-Distortion Derive

A Gaussian source has variance σ² = 1. Compute R(D) for D = 0.1 (low distortion).

bits/sample
Show derivation
R(0.1) = 0.5 · log2(1 / 0.1) = 0.5 · log2(10)
= 0.5 × 3.322 = 1.661 bits/sample

To represent a Gaussian source with 10% of its variance as distortion, you need at least 1.661 bits per sample. This is the absolute floor — no quantizer can do better.

Exercise 6.2: Halving the Distortion Derive

Same source (σ² = 1). How many additional bits per sample do you need to halve the distortion from D = 0.1 to D = 0.05?

additional bits
Show derivation
R(0.05) = 0.5 · log2(1/0.05) = 0.5 · log2(20) = 0.5 × 4.322 = 2.161
Additional = R(0.05) − R(0.1) = 2.161 − 1.661 = 0.500 bits

Halving the distortion costs exactly 0.5 bits per sample for a Gaussian source. This is a universal rule for Gaussian R-D: each halving of distortion adds 0.5 bits. Equivalently, each additional bit reduces distortion by factor of 4 (6 dB per bit).

Exercise 6.3: Quantization Bits Derive

You're quantizing neural network weights from FP32 to a fixed number of bits. If weights are approximately Gaussian with σ² = 0.01, and you can tolerate MSE distortion D = 0.0001 (1% of variance), how many bits per weight minimum?

bits/weight
Show derivation
R(D) = 0.5 · log2(0.01 / 0.0001) = 0.5 · log2(100)
= 0.5 × 6.644 = 3.322 bits/weight

This says ~3.3 bits per weight is the theoretical minimum for 1% distortion. In practice, 4-bit quantization (INT4) works well for many models — it's slightly above the theoretical limit. Going to 2-bit quantization would give D = σ²/24 = 0.000625, which is 6.25% of variance — often too much.

Exercise 6.4: R-D Tradeoff Trace
For a Gaussian source, if you use R = 0 bits (zero rate), what is the minimum achievable distortion D?
Show reasoning

R(D) = 0.5·log2(σ²/D). Setting R = 0: 0 = 0.5·log2(σ²/D), so σ²/D = 1, meaning D = σ². With zero bits, the optimal strategy is to always output the mean (0), giving MSE = E[X²] = σ². You can't encode anything, so you just predict the most likely value.

Exercise 6.5: 6 dB per Bit Rule Derive

Starting from R(D) = 0.5·log2(σ²/D), show that each additional bit of rate reduces distortion by a factor of 4. If D1 at R bits, what is D2 at R+1 bits? Express D2/D1 as a fraction.

ratio D⊂2/D⊂1
Show derivation
R = 0.5·log2(σ²/D) ⇒ D = σ² · 2−2R
D1 = σ² · 2−2R,   D2 = σ² · 2−2(R+1) = σ² · 2−2R−2
D2/D1 = 2−2 = 1/4 = 0.25

Each additional bit buys a 4x reduction in distortion. In dB: 10·log10(4) = 6.02 dB. This "6 dB per bit" rule is a cornerstone of audio/video engineering and quantization theory.

Chapter 7: Mutual Information in ML

In 2017, Shwartz-Ziv and Tishby proposed the information bottleneck view of deep learning: a neural network's hidden layers first compress the input (reduce I(X;T)) and then extract relevant information about the output (increase I(T;Y)), where T is the representation at a hidden layer.

I(X;T) ≥ I(T;Y)

The data processing inequality guarantees this: processing can never create information. If X → T → Y is a Markov chain, then I(X;Y) ≥ I(T;Y). A network can only preserve or lose information about the label, never create it from nothing.

The information plane. Plot I(X;T) on the x-axis and I(T;Y) on the y-axis for each layer. Early in training, layers move right (memorize inputs). Later, layers move left (compress) while staying high on I(T;Y) (retaining label information). This is the "fitting then compressing" narrative, though it remains debated.
Exercise 7.1: Data Processing Inequality Trace
Given the Markov chain X → T1 → T2 → Y, which statement is guaranteed by the data processing inequality?
Show reasoning

The DPI says: for X → T → Y, I(X;Y) ≥ I(T;Y). Each processing step can only lose information. So I(X;Y) ≥ I(T1;Y) ≥ I(T2;Y). A network must preserve the right information — it can't synthesize new predictive information about Y.

Exercise 7.2: MI for Binary Channel Derive

A binary classifier takes input X ∈ {0,1} (uniform) through a noisy channel with P(Y=1|X=1) = 0.9 and P(Y=1|X=0) = 0.2. Compute I(X;Y).

Compute: H(Y) from marginal P(Y), then H(Y|X) from the conditional distributions, then I = H(Y) − H(Y|X).

bits
Show derivation
P(Y=1) = 0.5×0.9 + 0.5×0.2 = 0.55
P(Y=0) = 0.45
H(Y) = −0.55·log2(0.55) − 0.45·log2(0.45)
= −0.55×(−0.862) − 0.45×(−1.152)
= 0.474 + 0.518 = 0.993 bits
H(Y|X=0) = Hb(0.2) = 0.722 bits
H(Y|X=1) = Hb(0.9) = 0.469 bits
H(Y|X) = 0.5×0.722 + 0.5×0.469 = 0.596 bits
I(X;Y) = 0.993 − 0.596 = 0.397 bits

Note Hb(0.9) = Hb(0.1) = 0.469 since Hb(p) = Hb(1−p). Knowing X reduces uncertainty about Y by 0.397 bits — a substantial fraction of H(Y). The channel is informative but noisy.

Exercise 7.3: Sufficient Statistics Trace
A statistic T(X) is sufficient for Y if I(T;Y) = I(X;Y). In the context of deep learning, this means:
Show reasoning

A sufficient statistic T captures everything about Y that X contained, while potentially discarding irrelevant details of X. This is the ideal: I(X;T) can be much smaller than H(X) (compression), but I(T;Y) = I(X;Y) (no label information lost). The information bottleneck objective seeks exactly this: minimize I(X;T) subject to preserving I(T;Y).

Exercise 7.4: Information Bottleneck Bound Derive

The IB objective minimizes L = I(X;T) − β·I(T;Y). If I(X;Y) = 0.8 bits and the optimal representation achieves I(T;Y) = 0.8 bits (sufficient) with I(X;T) = 2.0 bits, compute L for β = 1.0 and β = 5.0.

L at β=1.0
Show derivation
L(β=1.0) = 2.0 − 1.0 × 0.8 = 1.2
L(β=5.0) = 2.0 − 5.0 × 0.8 = −2.0

Higher β means we value label information more relative to compression. At β = 1, the cost of storing 2 bits dominates. At β = 5, the label information is so valuable that the Lagrangian is very negative (good). The IB method sweeps β to trace the optimal tradeoff curve.

Exercise 7.5: MI Estimation Challenge Trace
Why is MI between continuous high-dimensional representations (like neural network activations) hard to compute?
Show reasoning

MI = H(T) − H(T|Y) requires knowing the distribution of T. For a 768-dimensional hidden representation, estimating this density is essentially impossible with finite samples. This is the "curse of dimensionality" for MI estimation. Modern approaches use variational bounds (MINE, InfoNCE) or binning with caveats. This difficulty is why the information plane theory remains controversial.

Chapter 8: Bits-Back Coding & VAE

The variational autoencoder (VAE) has a deep connection to information theory. Its training objective, the ELBO (Evidence Lower Bound), is literally a coding cost:

ELBO = Eq(z|x)[log p(x|z)] − DKL(q(z|x) || p(z))

The first term is the reconstruction quality (how well the decoder recovers x from z). The second term is the information cost — how many extra bits the encoder uses beyond the prior. The ELBO is a lower bound on log p(x), the log-likelihood of the data.

Bits-back argument. The KL term has a coding interpretation: it's the number of bits you "waste" by encoding z using the posterior q(z|x) instead of the prior p(z). The clever insight of bits-back coding is that you can recover these bits if you have random bits to "spend" on encoding — hence the name "bits back." This makes the KL term not a pure cost but a net cost.

For Gaussian encoder q(z|x) = N(μ, σ²) and standard normal prior p(z) = N(0,1), the KL has a closed form:

DKL(N(μ1, σ1²) || N(μ2, σ2²)) = log(σ21) + (σ1² + (μ1−μ2)²) / (2σ2²) − 0.5
Exercise 8.1: KL to Standard Normal Derive

The VAE encoder outputs μ = 0.5, σ = 0.8 for one latent dimension. The prior is N(0,1). Compute the KL divergence in nats.

For q = N(μ, σ²) vs p = N(0,1): KL = −0.5·(1 + log(σ²) − μ² − σ²). Use ln(0.64) ≈ −0.446.

nats
Show derivation
KL = −0.5 × (1 + ln(0.64) − 0.25 − 0.64)
= −0.5 × (1 − 0.446 − 0.25 − 0.64)
= −0.5 × (−0.336) = 0.168

The encoder is moderately close to the prior — μ is only 0.5 away from 0, and σ = 0.8 is fairly close to 1. The KL cost per dimension is small. In a VAE with hundreds of latent dimensions, these small per-dimension costs add up.

Exercise 8.2: KL Between Two Gaussians Derive

Compute DKL(N(1, 0.5²) || N(0, 1²)). Use the full formula: KL = ln(σ21) + (σ1² + (μ1−μ2)²)/(2σ2²) − 0.5.

nats
Show derivation
μ1=1, σ1=0.5, μ2=0, σ2=1
KL = ln(1/0.5) + (0.25 + 1)/(2×1) − 0.5
= ln(2) + 1.25/2 − 0.5
= 0.693 + 0.625 − 0.5 = 0.818 nats
Exercise 8.3: ELBO Decomposition Trace
A VAE has reconstruction loss = 50.0 nats and KL loss = 3.2 nats. What is the ELBO, and what does maximizing it mean?
Show reasoning

ELBO = E[log p(x|z)] − KL. The reconstruction term is a negative log-likelihood, typically reported as a positive loss. So ELBO = −50.0 − 3.2 = −53.2. Maximizing the ELBO (making it less negative) means: (1) improve reconstruction (decrease the 50.0), and (2) keep the posterior close to the prior (decrease the 3.2). In practice, we minimize −ELBO = 53.2.

Exercise 8.4: KL Annealing Trace
In "posterior collapse," the VAE ignores z entirely (q(z|x) ≈ p(z) for all x). What are the KL and reconstruction losses in this regime?
Show reasoning

Posterior collapse: q(z|x) = p(z) regardless of x, so KL = 0 — the "information cost" is zero because the encoder transmits zero information. But this means the decoder gets no useful signal from z and must model p(x) entirely on its own, giving poor reconstruction. The fix: KL annealing — start with low KL weight and gradually increase it, letting the encoder learn useful representations before being penalized for diverging from the prior.

Exercise 8.5: Multi-Dimensional KL Derive

A VAE has latent dimension d = 3. The encoder outputs μ = (0.5, −0.3, 0.8) and σ = (0.7, 1.2, 0.4). The prior is N(0, I). Compute the total KL (sum over dimensions).

For each dimension j: KLj = −0.5(1 + ln(σj²) − μj² − σj²). Sum them.

nats
Show derivation
Dim 0: −0.5(1 + ln(0.49) − 0.25 − 0.49) = −0.5(1 − 0.713 − 0.25 − 0.49) = −0.5(−0.453) = 0.227
Dim 1: −0.5(1 + ln(1.44) − 0.09 − 1.44) = −0.5(1 + 0.365 − 0.09 − 1.44) = −0.5(−0.165) = 0.083
Dim 2: −0.5(1 + ln(0.16) − 0.64 − 0.16) = −0.5(1 − 1.833 − 0.64 − 0.16) = −0.5(−1.633) = 0.817

Total = 0.227 + 0.083 + 0.816 = 1.126 nats. Dimension 2 dominates: it has both a shifted mean (μ=0.8) and a narrow variance (σ=0.4), making it most different from the N(0,1) prior. Dimension 1 contributes least because its wider variance (σ=1.2) partially offsets the small mean shift.

Exercise 8.6: Find the Bug Debug

This VAE KL computation produces negative values. Click the buggy line.

function vaeKL(mu, logvar) {
  // mu, logvar: arrays of length d
  let kl = 0;
  for (let i = 0; i < mu.length; i++) {
    const sigma2 = Math.exp(logvar[i]);
    kl += 0.5 * (1 + logvar[i] - mu[i]*mu[i] - sigma2);
  }
  return kl;
}
Show explanation

Line 6 computes the negative ELBO contribution (which is negative, not the KL itself). The KL divergence is KL = 0.5·(1 + log(σ²) − μ² − σ²). The line is missing the negative sign. It should be: kl += -0.5 * (1 + logvar[i] - mu[i]*mu[i] - sigma2);. Without the negation, the sum is negative (since σ² + μ² > 1 + log(σ²) for typical values), making KL appear negative, which violates KL ≥ 0.

Chapter 9: Capstone: Compression Analysis

You've built the full toolkit. Now let's analyze a real-world compression scenario end-to-end: from character frequencies to Huffman bounds to LLM-as-compressor. This is the kind of back-of-envelope analysis that separates someone who understands information theory from someone who just memorized formulas.

The capstone challenge. A text file contains 10,000 characters with the following frequency counts: E:1270, T:906, A:817, O:751, I:697, N:675, S:633, H:609, R:599, space:2043, other:1000 (all remaining characters lumped together). You'll compute entropy, Huffman bounds, and connect it all to LLM compression.
Exercise 9.1: Character Entropy Derive

Convert the frequency counts to probabilities (divide by 10,000) and compute H(X). The probabilities are: space=0.2043, E=0.127, T=0.0906, A=0.0817, O=0.0751, I=0.0697, N=0.0675, S=0.0633, H=0.0609, R=0.0599, other=0.1.

Use the formula H = −∑ pi·log2(pi). This has 11 terms.

bits/char
Show derivation
H = −[0.2043·log2(0.2043) + 0.127·log2(0.127) + 0.0906·log2(0.0906)
  + 0.0817·log2(0.0817) + 0.0751·log2(0.0751) + 0.0697·log2(0.0697)
  + 0.0675·log2(0.0675) + 0.0633·log2(0.0633) + 0.0609·log2(0.0609)
  + 0.0599·log2(0.0599) + 0.1·log2(0.1)]
≈ 0.484 + 0.383 + 0.316 + 0.296 + 0.280 + 0.267 + 0.261 + 0.250 + 0.244 + 0.241 + 0.332
= 3.354 bits/char (approximately)

This is the zero-order (unigram) entropy — treating each character as independent. Real English text has much lower entropy (~1.0–1.5 bits/char) because of sequential dependencies. The gap between unigram entropy and true entropy is what n-gram and neural language models exploit.

Exercise 9.2: Theoretical Minimum File Size Derive

Given the unigram entropy from 9.1 (~3.3 bits/char), what is the theoretical minimum file size for the 10,000-character file? Express in bytes (8 bits = 1 byte). Compare to the naive ASCII size (8 bits/char).

bytes (minimum)
Show derivation
Minimum bits = 10,000 × 3.3 = 33,000 bits
Minimum bytes = 33,000 / 8 = 4,125 bytes
ASCII size = 10,000 × 8 / 8 = 10,000 bytes
Compression ratio = 10,000 / 4,125 = 2.42:1

We can compress to ~41% of the original size using just unigram statistics. Real compressors (gzip, zstd) do even better because they exploit sequential patterns. The Shannon limit with full context is ~1.2 bits/char × 10,000 = 1,500 bytes — a 6.67:1 ratio.

Exercise 9.3: Huffman vs Arithmetic Trace
Huffman coding is limited to integer bit lengths per symbol. Arithmetic coding can achieve fractional bits per symbol. For our 11-symbol alphabet with H ≈ 3.3 bits/symbol, which statement is correct?
Show reasoning

Huffman: H ≤ L < H + 1 (per symbol). Arithmetic coding can approach H to within O(1/n) total bits for a sequence of length n. For long sequences, arithmetic coding is near-optimal. This is why modern compressors (zlib, zstd, ANS) use arithmetic-coding variants rather than Huffman.

Exercise 9.4: LLM as Compressor Derive

A language model achieves perplexity 15.3 on a test set. Convert this to bits per character (assuming character-level tokenization). Perplexity = 2H where H is the cross-entropy in bits.

bits/char
Show derivation
PPL = 2H ⇒ H = log2(PPL) = log2(15.3)
= 3.935 bits/char

This model is worse than unigram Huffman (3.3 bits/char)! It must be a bad model or the test set is unusual. GPT-4-class models achieve ~0.7–0.9 bits per character on standard English, approaching the Shannon limit. A perplexity of 15.3 at the character level is quite high.

Exercise 9.5: Subword Perplexity to BPC Derive

Most LLMs use subword tokenization. GPT-2 achieves perplexity 29.4 on WikiText-103 with an average of 1.3 tokens per word and 4.5 characters per word. Convert to bits per character.

bits/token = log2(PPL). tokens/char = 1.3/4.5. bits/char = bits/token × tokens/char.

bits/char
Show derivation
bits/token = log2(29.4) = 4.877
tokens/char = 1.3 / 4.5 = 0.2889
bits/char = 4.877 × 0.2889 = 1.409

~1.4 bits per character is in the range of Shannon's estimate for English (1.0–1.5 bits/char). This means GPT-2 captures most of the statistical structure of English text. Better models (GPT-4) push this even lower, suggesting the true entropy of English may be below 1.0 bits/char.

Exercise 9.6: Compression Comparison Derive

For our 10,000-character file, compute the compressed size (in bytes) for each method: (a) Unigram Huffman at 3.3 bits/char, (b) gzip-like at ~2.0 bits/char, (c) LLM-based at 1.4 bits/char. Express all as bytes.

bytes (LLM method)
Show derivation
(a) Huffman: 10,000 × 3.3 / 8 = 4,125 bytes
(b) gzip: 10,000 × 2.0 / 8 = 2,500 bytes
(c) LLM: 10,000 × 1.4 / 8 = 1,750 bytes

From 10 KB original to 1.75 KB — a 5.7:1 compression ratio. The progression shows how each level of modeling captures more structure: unigram frequencies → local patterns (gzip) → full linguistic context (LLM). This is why "language modeling is compression" is not just a metaphor — a better language model is literally a better compressor.

Exercise 9.7: Arrange the Compression Stack Design

Order these compression approaches from worst to best compression ratio on English text.

?
?
?
?
?
ASCII (8 bits/char) Huffman (~3.3 bits) gzip/LZ77 (~2.0 bits) LLM-based (~1.0 bits) Shannon limit (~0.8 bits)
Show correct order

Worst → Best compression: ASCII (no compression) → Huffman (unigram model) → gzip (local pattern matching) → LLM (full context model) → Shannon limit (theoretical floor). Each step exploits more structure in the data. The Shannon limit is unreachable but approachable.

Exercise 9.8: Implement bitsPerChar() Build

Write a function that takes a perplexity value and characters-per-token ratio, and returns bits per character.

bits/token = log2(PPL). bits/char = bits/token / chars/token.
Show solution
javascript
function bitsPerChar(perplexity, charsPerToken) {
  const bitsPerToken = Math.log2(perplexity);
  return bitsPerToken / charsPerToken;
}