Information Theory Workbook — Engineermaxxing

Chapter 0: Entropy

You flip a fair coin. Before it lands, how "uncertain" are you about the outcome? Intuitively: maximally uncertain between two options. Shannon formalized this intuition into a single number called entropy — measured in bits.

The entropy of a discrete random variable X with probabilities p₁, p₂, …, p_n is:

H(X) = −∑_i p_i · log₂(p_i)

Why log₂? Because one bit is the amount of information needed to distinguish between two equally likely outcomes. Why the negative sign? Because log₂(p) is negative when p < 1, and we want entropy to be positive.

Entropy = expected surprise. The quantity −log₂(p_i) is called the surprisal or self-information of outcome i. A rare event (small p) has high surprisal. Entropy is just the average surprisal, weighted by probability. High entropy means "on average, outcomes are surprising" — i.e., the distribution is spread out.

Fair coin: p = (0.5, 0.5). H = −0.5·log₂(0.5) − 0.5·log₂(0.5) = −0.5·(−1) − 0.5·(−1) = 1 bit.

Biased coin: p = (0.9, 0.1). H = −0.9·log₂(0.9) − 0.1·log₂(0.1) = −0.9·(−0.152) − 0.1·(−3.322) = 0.137 + 0.332 = 0.469 bits. Less uncertainty because you can almost predict the outcome.

English text has about 1.0–1.5 bits per character. Shannon estimated this in 1951 by having people guess the next letter — the fact that English is so predictable (low entropy) is why compression works so well.

Exercise 0.1: Three-Sided Die Derive

A fair 3-sided die has outcomes {1, 2, 3} each with probability 1/3. Compute H(X) in bits.

H = −3 × (1/3)·log₂(1/3). Remember: log₂(3) ≈ 1.595.

bits

Show derivation

H = −3 × (1/3) × log₂(1/3)

= −3 × (1/3) × (−1.595)

= 3 × (1/3) × 1.595 = 1.595 bits

For a fair n-sided die, H = log₂(n). This is always the maximum entropy for n outcomes — a uniform distribution maximizes uncertainty.

Exercise 0.2: Weighted Die Derive

A 4-sided die has probabilities P = (0.5, 0.25, 0.125, 0.125). Compute H(X).

bits

Show derivation

H = −[0.5·log₂(0.5) + 0.25·log₂(0.25) + 0.125·log₂(0.125) + 0.125·log₂(0.125)]

= −[0.5·(−1) + 0.25·(−2) + 0.125·(−3) + 0.125·(−3)]

= −[−0.5 − 0.5 − 0.375 − 0.375]

= 1.75 bits

Maximum entropy for 4 outcomes is log₂(4) = 2 bits. This distribution has 1.75 bits — less than maximum because it's non-uniform.

Exercise 0.3: Maximum Entropy Trace

Which distribution over 8 outcomes has the highest entropy?

P = (0.9, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.04) P = (0.5, 0.5, 0, 0, 0, 0, 0, 0) P = (1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8) — uniform P = (1, 0, 0, 0, 0, 0, 0, 0)

Show reasoning

Entropy is maximized by the uniform distribution. H_uniform = log₂(8) = 3 bits. The deterministic case (1,0,…,0) has H = 0 — no uncertainty at all. The concentrated case (0.9, …) has H ≈ 0.922 bits. The binary case (0.5, 0.5, 0, …) has H = 1 bit.

Exercise 0.4: Binary Entropy Function Derive

The binary entropy function H_b(p) = −p·log₂(p) − (1−p)·log₂(1−p) describes the entropy of a coin with bias p. Compute H_b(0.2).

bits

Show derivation

H_b(0.2) = −0.2·log₂(0.2) − 0.8·log₂(0.8)

= −0.2·(−2.322) − 0.8·(−0.322)

= 0.464 + 0.259 = 0.722 bits

Exercise 0.5: Implement entropy() Build

Write a function that computes the Shannon entropy in bits given an array of probabilities. Handle the convention that 0·log(0) = 0.

Use Math.log2(). Skip any p === 0.

Show solution

javascript
function entropy(probs) {
  let h = 0;
  for (const p of probs) {
    if (p > 0) h -= p * Math.log2(p);
  }
  return h;
}

Exercise 0.6: Find the Bug Debug

This entropy function returns negative values. Click the buggy line.

function entropy(probs) {
  let h = 0;
  for (const p of probs) {
    if (p > 0) h += p * Math.log2(p);
  }
  return h;
}

Show explanation

Line 4 uses += instead of -=. Since log₂(p) is negative for 0 < p < 1, summing p * log2(p) produces a negative number. The definition of entropy has a negative sign: H = −∑ p·log(p). Either negate the sum at the end or accumulate with -=.

Chapter 1: Joint & Conditional Entropy

So far we measured uncertainty in a single variable. But in ML, we almost always care about relationships between variables — input and output, features and labels, encoder and decoder. We need tools for measuring shared and conditional uncertainty.

Joint entropy H(X,Y) measures the total uncertainty in the pair (X,Y):

H(X,Y) = −∑_x,y p(x,y) · log₂ p(x,y)

Conditional entropy H(Y|X) measures the remaining uncertainty in Y once you know X:

H(Y|X) = H(X,Y) − H(X)

Chain rule of entropy. H(X,Y) = H(X) + H(Y|X). Total uncertainty = uncertainty in X + leftover uncertainty in Y after seeing X. This is the information-theoretic analog of the probability chain rule p(x,y) = p(x)·p(y|x).

Mutual information I(X;Y) measures how much knowing X reduces uncertainty about Y:

I(X;Y) = H(Y) − H(Y|X) = H(X) − H(X|Y) = H(X) + H(Y) − H(X,Y)

A critical inequality: H(Y|X) ≤ H(Y). Conditioning never increases entropy. In the worst case (X and Y are independent), H(Y|X) = H(Y) and mutual information is zero.

Exercise 1.1: Joint Entropy from Table Derive

Given this joint distribution over (X, Y) where X ∈ {0,1} and Y ∈ {0,1}:

	Y=0	Y=1
X=0	0.4	0.1
X=1	0.1	0.4

Compute H(X,Y). Apply the entropy formula to all four joint probabilities.

bits

Show derivation

H(X,Y) = −[2×0.4·log₂(0.4) + 2×0.1·log₂(0.1)]

= −[2×0.4×(−1.322) + 2×0.1×(−3.322)]

= −[−1.059 − 0.664] = 1.722 bits

Exercise 1.2: Conditional Entropy Derive

Using the same table from 1.1: the marginal P(X) = (0.5, 0.5) so H(X) = 1 bit. Compute H(Y|X) using the chain rule.

bits

Show derivation

H(Y|X) = H(X,Y) − H(X) = 1.722 − 1.0 = 0.722 bits

Knowing X reduces our uncertainty about Y from H(Y) = 1.0 bit to 0.722 bits. The mutual information I(X;Y) = H(Y) − H(Y|X) = 1.0 − 0.722 = 0.278 bits.

Exercise 1.3: Mutual Information Derive

For the same table: compute I(X;Y) using the formula I(X;Y) = H(X) + H(Y) − H(X,Y). Note that H(Y) = 1 bit (marginal is also 0.5, 0.5).

bits

Show derivation

I(X;Y) = H(X) + H(Y) − H(X,Y) = 1 + 1 − 1.722 = 0.278 bits

This matches I(X;Y) = H(Y) − H(Y|X) = 1.0 − 0.722 = 0.278. Mutual information is symmetric: knowing X tells you as much about Y as knowing Y tells you about X.

Exercise 1.4: Independent Variables Trace

If X and Y are independent (p(x,y) = p(x)·p(y) for all x,y), what is I(X;Y)?

I(X;Y) = H(X) — maximum information I(X;Y) = 0 — knowing X tells you nothing about Y I(X;Y) = H(X) + H(Y) — total entropy I(X;Y) = −1 — mutual information can be negative for independent variables

Show reasoning

If X and Y are independent, H(X,Y) = H(X) + H(Y). So I(X;Y) = H(X) + H(Y) − H(X,Y) = 0. Equivalently, H(Y|X) = H(Y) — learning X doesn't reduce uncertainty about Y at all. Mutual information is always ≥ 0.

Exercise 1.5: Joint Entropy of Independent Dice Derive

You roll two independent fair 6-sided dice. What is H(X,Y)?

For independent variables: H(X,Y) = H(X) + H(Y).

bits

Show derivation

H(X) = log₂(6) = 2.595 bits

H(X,Y) = H(X) + H(Y) = 2 × 2.595 = 5.170 bits

Equivalently: the pair (X,Y) has 36 equally likely outcomes, so H = log₂(36) = log₂(6²) = 2·log₂(6) = 5.170 bits.

Exercise 1.6: Conditioning Reduces Entropy Trace

A weather model predicts rain (R) and temperature (T). You observe that H(T) = 3.2 bits, H(T|R) = 2.8 bits, and H(R) = 0.9 bits. What is the mutual information I(R;T)?

0.1 bits 0.4 bits 0.9 bits 3.2 bits

Show reasoning

I(R;T) = H(T) − H(T|R) = 3.2 − 2.8 = 0.4 bits

Knowing whether it rains reduces temperature uncertainty by 0.4 bits. This is intuitive — rainy days tend to be cooler, so rain gives you partial information about temperature.

Chapter 2: KL Divergence

Entropy measures uncertainty in a single distribution. But in ML, we constantly compare distributions: How different is my model's output from the true distribution? How much did fine-tuning change the model? KL divergence quantifies this distance.

D_KL(P || Q) = ∑_i p_i · log₂(p_i / q_i)

Read it as: "the KL divergence from Q to P" or "the extra bits needed to encode samples from P using a code optimized for Q." The order matters — P is the "true" distribution, Q is the "approximation."

KL is NOT symmetric. D_KL(P||Q) ≠ D_KL(Q||P) in general. This is why KL is not a true "distance." It measures the penalty for using Q when the truth is P, which is a different question from using P when the truth is Q.

Gibbs' inequality: D_KL(P||Q) ≥ 0, with equality if and only if P = Q. This is one of the most fundamental inequalities in information theory, and it's why minimizing KL divergence is equivalent to finding the closest approximation.

Exercise 2.1: KL for Fair vs. Biased Derive

Compute D_KL(P||Q) where P = (0.5, 0.5) and Q = (0.9, 0.1).

D_KL = 0.5·log₂(0.5/0.9) + 0.5·log₂(0.5/0.1).

bits

Show derivation

D_KL(P||Q) = 0.5 · log₂(0.5/0.9) + 0.5 · log₂(0.5/0.1)

= 0.5 · log₂(0.556) + 0.5 · log₂(5.0)

= 0.5 · (−0.848) + 0.5 · 2.322

= −0.424 + 1.161 = 0.737 bits

The fair distribution P "wastes" 0.737 bits when encoded using a code optimized for Q. This is asymmetric — compare with D_KL(Q||P) in the next exercise.

Exercise 2.2: KL in the Other Direction Derive

Now compute D_KL(Q||P) where Q = (0.9, 0.1) and P = (0.5, 0.5). Compare with the previous exercise to see asymmetry.

bits

Show derivation

D_KL(Q||P) = 0.9 · log₂(0.9/0.5) + 0.1 · log₂(0.1/0.5)

= 0.9 · log₂(1.8) + 0.1 · log₂(0.2)

= 0.9 · 0.848 + 0.1 · (−2.322)

= 0.763 − 0.232 = 0.531 bits

D_KL(P||Q) ≠ D_KL(Q||P) in general, though in this specific case with P uniform they happen to relate simply. The asymmetry becomes dramatic when Q has a zero probability where P is nonzero — that makes D_KL(P||Q) = ∞.

Exercise 2.3: KL Between Close Distributions Derive

Compute D_KL(P||Q) where P = (0.4, 0.3, 0.3) and Q = (0.35, 0.35, 0.3).

Hint: log₂(0.4/0.35) = log₂(8/7) ≈ 0.193. log₂(0.3/0.35) = log₂(6/7) ≈ −0.222. log₂(0.3/0.3) = 0.

bits

Show derivation

D_KL = 0.4·log₂(8/7) + 0.3·log₂(6/7) + 0.3·log₂(1)

= 0.4 × 0.193 + 0.3 × (−0.222) + 0

= 0.0772 − 0.0666 = 0.0106 bits

Very close distributions have very small KL. This is why KL works well as a loss — as the model distribution approaches the target, the loss smoothly approaches zero.

Exercise 2.4: When KL Explodes Trace

P = (0.5, 0.5) and Q = (1.0, 0.0). What is D_KL(P||Q)?

1.0 bits 0 bits — Q is a special case of P ∞ — Q assigns zero probability to an event P considers possible Undefined — KL can't compare distributions with different supports

Show reasoning

D_KL(P||Q) requires computing log(p/q). When p > 0 but q = 0, we get log(p/0) = ∞. This is infinite divergence — a code designed for Q literally cannot encode an event that Q says is impossible but P says can happen. This is why in ML, we use label smoothing or add small ε to predicted probabilities.

Exercise 2.5: KL Decomposition Trace

The fundamental decomposition is: H(P,Q) = H(P) + D_KL(P||Q), where H(P,Q) is cross-entropy. If H(P) = 1.5 bits and D_KL(P||Q) = 0.3 bits, what is the cross-entropy?

1.8 bits 1.2 bits 0.3 bits 1.5 bits

Show reasoning

CE(P,Q) = H(P) + D_KL(P||Q) = 1.5 + 0.3 = 1.8 bits

Cross-entropy is always at least as large as the true entropy H(P). The gap is exactly D_KL — the "cost of being wrong." This is why minimizing cross-entropy is equivalent to minimizing KL divergence (since H(P) is fixed).

Exercise 2.6: Implement klDivergence() Build

Write a function that computes D_KL(P||Q) in bits. If q_i = 0 where p_i > 0, return Infinity.

Use Math.log2(). Check for q[i] === 0 when p[i] > 0.

Show solution

javascript
function klDivergence(p, q) {
  let kl = 0;
  for (let i = 0; i < p.length; i++) {
    if (p[i] === 0) continue;
    if (q[i] === 0) return Infinity;
    kl += p[i] * Math.log2(p[i] / q[i]);
  }
  return kl;
}

Chapter 3: Cross-Entropy

Here's where information theory meets ML directly. Every time you train a classifier, you're minimizing cross-entropy loss. It's not an arbitrary choice — it's the information-theoretically optimal objective.

H(P, Q) = −∑_i p_i · log₂(q_i)

Where P is the true distribution (labels) and Q is the model's predicted distribution. In classification, P is a one-hot vector: p_c = 1 for the true class c, and 0 elsewhere. This simplifies beautifully:

H(P, Q) = −log₂(q_c) = −log(q_c) / log(2)

Cross-entropy IS the ML loss. The standard classification loss -log(predicted_probability_of_true_class) is literally the cross-entropy between the one-hot label and the softmax output. Minimizing cross-entropy = minimizing KL divergence from the true label distribution, because H(P,Q) = H(P) + D_KL(P||Q) and H(P) is constant.

In practice, ML frameworks use natural log (base e) instead of log₂, so the loss is in nats rather than bits. The relationship: 1 nat = 1/ln(2) ≈ 1.443 bits.

Exercise 3.1: Classification Loss Derive

A model predicts softmax probabilities [0.7, 0.2, 0.1] for a 3-class problem. The true class is 0 (first class). Compute the cross-entropy loss in nats.

For one-hot labels: CE = −ln(q_{true_class}). Use ln(0.7) ≈ −0.357.

nats

Show derivation

CE = −ln(0.7) = 0.357 nats

In bits: 0.357 × 1.443 = 0.515 bits. The loss only depends on the predicted probability of the true class. If the model predicted 0.99 for the true class, the loss would be −ln(0.99) = 0.01 nats — nearly perfect.

Exercise 3.2: Worst-Case Loss Derive

The model predicts [0.01, 0.01, 0.98] but the true class is 0. Compute the cross-entropy loss in nats.

nats

Show derivation

CE = −ln(0.01) = ln(100) = 4.605 nats

The model was very confident in the wrong class. The loss is huge. If the model had predicted exactly 0 for the true class, the loss would be infinity — this is why we never let softmax outputs reach exact zero.

Exercise 3.3: Full Cross-Entropy (Soft Labels) Derive

With label smoothing, the target distribution is P = (0.9, 0.05, 0.05) instead of one-hot. Model predicts Q = (0.7, 0.2, 0.1). Compute CE(P,Q) in nats.

Use the full formula: −∑ p_i·ln(q_i). ln(0.7) ≈ −0.357, ln(0.2) ≈ −1.609, ln(0.1) ≈ −2.303.

nats

Show derivation

CE = −[0.9·ln(0.7) + 0.05·ln(0.2) + 0.05·ln(0.1)]

= −[0.9×(−0.357) + 0.05×(−1.609) + 0.05×(−2.303)]

= −[−0.321 − 0.080 − 0.115]

= 0.517 nats

With label smoothing, the loss is higher (0.517 vs 0.357) even with the same predictions, because we're penalizing the model for not spreading some probability to the non-true classes. This acts as regularization.

Exercise 3.4: CE vs KL Relationship Trace

A one-hot distribution P = (1, 0, 0, 0) has entropy H(P) = 0. If CE(P, Q) = 2.3 nats, what is D_KL(P||Q)?

0 nats — one-hot labels always have KL = 0 2.3 nats — when H(P) = 0, CE equals KL exactly 4.6 nats — KL is always double the CE Cannot determine without knowing Q

Show reasoning

CE(P,Q) = H(P) + D_KL(P||Q)

2.3 = 0 + D_KL(P||Q)

D_KL(P||Q) = 2.3 nats

For one-hot labels, cross-entropy and KL divergence are identical. This is why minimizing cross-entropy loss is exactly the same as minimizing KL divergence from the labels. The two objectives are interchangeable.

Exercise 3.5: Implement crossEntropy() Build

Write a function that computes cross-entropy in nats. Takes true distribution p and predicted distribution q.

CE = -sum(p[i] * ln(q[i])). Skip terms where p[i] = 0.

Show solution

javascript
function crossEntropy(p, q) {
  let ce = 0;
  for (let i = 0; i < p.length; i++) {
    if (p[i] > 0) ce -= p[i] * Math.log(q[i]);
  }
  return ce;
}

Exercise 3.6: Nats to Bits Derive

A language model achieves a cross-entropy loss of 3.2 nats per token. Convert this to bits per token. (1 bit = ln(2) ≈ 0.693 nats.)

bits/token

Show derivation

bits = nats / ln(2) = 3.2 / 0.693 = 4.617 bits/token

Equivalently: bits = nats × log₂(e) = 3.2 × 1.4427 = 4.617. Perplexity = 2^bits = 2^4.617 ≈ 24.5, meaning the model is "as confused as choosing uniformly among 24.5 tokens."

Chapter 4: Source Coding

Shannon's source coding theorem (1948) tells us the fundamental limit of lossless compression: you cannot compress data below its entropy rate, on average. Any lossless code must use at least H(X) bits per symbol on average.

Average code length L ≥ H(X)

Huffman coding achieves this bound to within 1 bit per symbol. The algorithm: assign shorter codes to more probable symbols, longer codes to less probable ones. The construction is bottom-up: repeatedly merge the two lowest-probability symbols.

Why Huffman beats fixed-length. With 5 symbols, a fixed-length code needs ⌈log₂(5)⌉ = 3 bits per symbol. But if one symbol appears 40% of the time, Huffman can give it a 1-bit code and push rare symbols to 3-4 bits. The average drops below 3.

Exercise 4.1: Entropy Lower Bound Derive

Five symbols A-E have probabilities P = (0.4, 0.2, 0.2, 0.1, 0.1). Compute the entropy H(X) — this is the theoretical minimum average code length.

bits/symbol

Show derivation

H = −[0.4·log₂(0.4) + 2×0.2·log₂(0.2) + 2×0.1·log₂(0.1)]

= −[0.4×(−1.322) + 2×0.2×(−2.322) + 2×0.1×(−3.322)]

= 0.529 + 0.929 + 0.664 = 2.122 bits/symbol

Exercise 4.2: Build the Huffman Tree Trace

For P = (A:0.4, B:0.2, C:0.2, D:0.1, E:0.1), the Huffman algorithm first merges the two smallest. What are the correct Huffman code lengths for each symbol?

A:1, B:2, C:2, D:3, E:3 — all codes are different lengths A:1, B:3, C:3, D:3, E:3 — A gets 1 bit, rest get 3 A:2, B:2, C:2, D:2, E:2 — all 2 bits (same as fixed-length) A:2, B:2, C:3, D:3, E:3 — medium symbols get 2 bits

Show Huffman construction

Step 1: Merge D(0.1) + E(0.1) = DE(0.2). Queue: A(0.4), B(0.2), C(0.2), DE(0.2).

Step 2: Merge B(0.2) + C(0.2) = BC(0.4). Queue: A(0.4), BC(0.4), DE(0.2).

Wait — we should merge the two smallest: actually after step 1, we pick the two smallest from {A:0.4, B:0.2, C:0.2, DE:0.2}. That's any two of {B, C, DE}. Merge C(0.2) + DE(0.2) = CDE(0.4). Queue: A(0.4), B(0.2), CDE(0.4).

Step 3: Merge B(0.2) + A(0.4) or B(0.2) + CDE(0.4). Merge B(0.2) + CDE(0.4) = BCDE(0.6). Queue: A(0.4), BCDE(0.6).

Step 4: Merge A(0.4) + BCDE(0.6) = root.

Result: A=1 bit (code: 0), B=3 bits (code: 100), C=3 bits (code: 110), D=3 bits (code: 1110), E=3 bits (code: 1111). Actually this depends on tie-breaking. One valid Huffman: A:1, B:3, C:3, D:3, E:3.

Exercise 4.3: Average Code Length Derive

Using the Huffman codes A:1 bit, B:3 bits, C:3 bits, D:3 bits, E:3 bits with probabilities (0.4, 0.2, 0.2, 0.1, 0.1), compute the average code length L.

bits/symbol

Show derivation

L = 0.4×1 + 0.2×3 + 0.2×3 + 0.1×3 + 0.1×3

= 0.4 + 0.6 + 0.6 + 0.3 + 0.3 = 2.2 bits/symbol

The entropy was 2.122 bits. The Huffman code achieves 2.2 bits — only 0.078 bits above the theoretical minimum. The efficiency is H/L = 2.122/2.2 = 96.5%. A better Huffman tree (codes A:1, B:2, C:3, D:4, E:4 or similar) could get closer.

Exercise 4.4: Optimal Huffman Derive

An alternative valid Huffman tree gives codes A:1 bit, B:2 bits, C:3 bits, D:4 bits, E:4 bits. Compute the average code length with probabilities (0.4, 0.2, 0.2, 0.1, 0.1).

bits/symbol

Show derivation

L = 0.4×1 + 0.2×2 + 0.2×3 + 0.1×4 + 0.1×4

= 0.4 + 0.4 + 0.6 + 0.4 + 0.4 = 2.2 bits/symbol

Same average! Different valid Huffman trees can produce different code assignments, but the average code length is always the same. This Huffman average of 2.2 is optimal — no prefix-free code can do better for these probabilities.

Exercise 4.5: Compression Ratio Derive

A file contains 100,000 characters from alphabet {A,B,C,D,E} with frequencies matching probabilities (0.4, 0.2, 0.2, 0.1, 0.1). Fixed-length encoding uses 3 bits/char. What's the compression ratio using Huffman (2.2 bits/char)?

Compression ratio = uncompressed size / compressed size.

:1 ratio

Show derivation

Uncompressed = 100,000 × 3 = 300,000 bits

Compressed = 100,000 × 2.2 = 220,000 bits

Ratio = 300,000 / 220,000 = 1.364:1

We saved 80,000 bits (10 KB). The theoretical limit would be 100,000 × 2.122 = 212,200 bits, giving ratio 1.414:1. Huffman gets us 96.5% of the way there.

Exercise 4.6: Arrange the Compression Pipeline Design

Put these steps of Huffman compression in order.

→

Count symbol frequencies Build Huffman tree Assign codes from tree Replace symbols with codes Prepend codebook header

Show correct order

1. Count frequencies → 2. Build tree (merge lowest-probability nodes) → 3. Assign codes (left=0, right=1 from root) → 4. Encode (replace each symbol with its code) → 5. Header (prepend the codebook so the decoder can reconstruct the tree).

Chapter 5: Channel Capacity

Source coding tells you how much you can compress. Channel coding tells you how fast you can reliably communicate over a noisy channel. Shannon's channel coding theorem (1948): for any rate below the channel capacity C, there exist codes that achieve arbitrarily low error probability.

The binary symmetric channel (BSC) flips each bit independently with probability p:

C_BSC = 1 − H_b(p) = 1 + p·log₂(p) + (1−p)·log₂(1−p)

When p = 0 (noiseless), C = 1 bit per channel use. When p = 0.5 (pure noise), C = 0 — the channel is useless.

The AWGN channel (additive white Gaussian noise) models continuous signals with Gaussian noise. Its capacity is the famous Shannon-Hartley theorem:

C = W · log₂(1 + SNR) bits/second

where W is bandwidth in Hz and SNR is the signal-to-noise ratio (linear, not dB).

Exercise 5.1: BSC Capacity (p=0.1) Derive

Compute the capacity of a BSC with flip probability p = 0.1. First compute H_b(0.1), then subtract from 1.

bits/use

Show derivation

H_b(0.1) = −0.1·log₂(0.1) − 0.9·log₂(0.9)

= −0.1×(−3.322) − 0.9×(−0.152)

= 0.332 + 0.137 = 0.469 bits

C = 1 − 0.469 = 0.531 bits/channel use

Even with 10% bit flips, we can still reliably communicate at 0.531 bits per channel use. The trick is using error-correcting codes that add redundancy — you send more bits than needed, and the decoder corrects the errors.

Exercise 5.2: AWGN Channel Derive

An AWGN channel has bandwidth W = 1 MHz and SNR = 10 (linear). Compute the capacity in Mbits/second.

Mbits/s

Show derivation

C = 1,000,000 × log₂(1 + 10)

= 1,000,000 × log₂(11)

= 1,000,000 × 3.459 = 3.459 Mbits/s

Doubling the bandwidth doubles the capacity (linear). But doubling the SNR only adds log₂(2) ≈ 1 bit/s/Hz in the high-SNR regime. Bandwidth is precious; power is cheap. This drives engineering decisions in telecommunications.

Exercise 5.3: SNR in dB Derive

A Wi-Fi channel has SNR = 20 dB and bandwidth 20 MHz. Convert SNR to linear (SNR_linear = 10^SNR_dB/10) and compute capacity.

Mbits/s

Show derivation

SNR_linear = 10^20/10 = 10² = 100

C = 20 × 10⁶ × log₂(101)

= 20 × 10⁶ × 6.659

= 133.15 Mbits/s

Real Wi-Fi 5 (802.11ac) on a 20 MHz channel achieves about 87 Mbits/s — roughly 65% of Shannon capacity. The gap is due to practical coding, guard intervals, and protocol overhead.

Exercise 5.4: Useless Channel Trace

A BSC with p = 0.5 has capacity C = 0. Why is a coin-flip channel useless, and what about p = 1.0?

p=0.5 is useless; p=1.0 is also useless (all bits are flipped) p=0.5 is useless; p=1.0 has C=0.5 p=0.5 is useless (output independent of input); p=1.0 has C=1 (just flip the output) Both are useless — any p > 0 makes reliable communication impossible

Show reasoning

At p = 0.5, the output is a random coin flip regardless of input — zero information passes through. But at p = 1.0, every bit is flipped deterministically. The receiver just inverts every bit and recovers the original perfectly. C = 1 − H_b(1.0) = 1 − 0 = 1 bit. A channel that always flips is just as good as one that never flips!

Exercise 5.5: Bandwidth vs Power Tradeoff Derive

You need 10 Mbits/s capacity. Option A: W = 5 MHz, find minimum SNR. Option B: W = 2 MHz, find minimum SNR. Compute both.

From C = W·log₂(1+SNR): SNR = 2^C/W − 1.

SNR (linear) for Option A

Show derivation

Option A: SNR = 2^10/5 − 1 = 2² − 1 = 3

Option B: SNR = 2^10/2 − 1 = 2⁵ − 1 = 31

Halving the bandwidth requires ~10x more power (SNR). This exponential tradeoff is why bandwidth is the more valuable resource — spectrum is scarce, transmit power is adjustable.

Chapter 6: Rate-Distortion

Source coding (Ch 4) was about lossless compression — no information lost. But what if you're willing to tolerate some distortion? JPEG, MP3, video codecs, and neural network quantization all exploit this. Rate-distortion theory tells us the minimum bit rate R needed to represent a source with average distortion at most D.

R(D) = min_p(x̂|x) I(X; X̂) subject to E[d(X, X̂)] ≤ D

For a Gaussian source X ~ N(0, σ²) with mean squared error distortion:

R(D) = 0.5 · log₂(σ² / D) for D ≤ σ²

The lossy compression tradeoff. R(D) is a decreasing convex function. Allowing a little distortion saves a lot of bits. Allowing a lot of distortion saves diminishing additional bits. This curve is the fundamental limit every lossy codec operates against.

Exercise 6.1: Gaussian Rate-Distortion Derive

A Gaussian source has variance σ² = 1. Compute R(D) for D = 0.1 (low distortion).

bits/sample

Show derivation

R(0.1) = 0.5 · log₂(1 / 0.1) = 0.5 · log₂(10)

= 0.5 × 3.322 = 1.661 bits/sample

To represent a Gaussian source with 10% of its variance as distortion, you need at least 1.661 bits per sample. This is the absolute floor — no quantizer can do better.

Exercise 6.2: Halving the Distortion Derive

Same source (σ² = 1). How many additional bits per sample do you need to halve the distortion from D = 0.1 to D = 0.05?

additional bits

Show derivation

R(0.05) = 0.5 · log₂(1/0.05) = 0.5 · log₂(20) = 0.5 × 4.322 = 2.161

Additional = R(0.05) − R(0.1) = 2.161 − 1.661 = 0.500 bits

Halving the distortion costs exactly 0.5 bits per sample for a Gaussian source. This is a universal rule for Gaussian R-D: each halving of distortion adds 0.5 bits. Equivalently, each additional bit reduces distortion by factor of 4 (6 dB per bit).

Exercise 6.3: Quantization Bits Derive

You're quantizing neural network weights from FP32 to a fixed number of bits. If weights are approximately Gaussian with σ² = 0.01, and you can tolerate MSE distortion D = 0.0001 (1% of variance), how many bits per weight minimum?

bits/weight

Show derivation

R(D) = 0.5 · log₂(0.01 / 0.0001) = 0.5 · log₂(100)

= 0.5 × 6.644 = 3.322 bits/weight

This says ~3.3 bits per weight is the theoretical minimum for 1% distortion. In practice, 4-bit quantization (INT4) works well for many models — it's slightly above the theoretical limit. Going to 2-bit quantization would give D = σ²/2⁴ = 0.000625, which is 6.25% of variance — often too much.

Exercise 6.4: R-D Tradeoff Trace

For a Gaussian source, if you use R = 0 bits (zero rate), what is the minimum achievable distortion D?

D = 0 — you can always achieve zero distortion D = σ² — best you can do is predict the mean, getting MSE = variance D = ∞ — without any bits, distortion is unbounded D = σ — distortion equals standard deviation

Show reasoning

R(D) = 0.5·log₂(σ²/D). Setting R = 0: 0 = 0.5·log₂(σ²/D), so σ²/D = 1, meaning D = σ². With zero bits, the optimal strategy is to always output the mean (0), giving MSE = E[X²] = σ². You can't encode anything, so you just predict the most likely value.

Exercise 6.5: 6 dB per Bit Rule Derive

Starting from R(D) = 0.5·log₂(σ²/D), show that each additional bit of rate reduces distortion by a factor of 4. If D₁ at R bits, what is D₂ at R+1 bits? Express D₂/D₁ as a fraction.

ratio D⊂2/D⊂1

Show derivation

R = 0.5·log₂(σ²/D) ⇒ D = σ² · 2^−2R

D₁ = σ² · 2^−2R, D₂ = σ² · 2^−2(R+1) = σ² · 2^−2R−2

D₂/D₁ = 2⁻² = 1/4 = 0.25

Each additional bit buys a 4x reduction in distortion. In dB: 10·log₁₀(4) = 6.02 dB. This "6 dB per bit" rule is a cornerstone of audio/video engineering and quantization theory.

Chapter 8: Bits-Back Coding & VAE

The variational autoencoder (VAE) has a deep connection to information theory. Its training objective, the ELBO (Evidence Lower Bound), is literally a coding cost:

ELBO = E_q(z|x)[log p(x|z)] − D_KL(q(z|x) || p(z))

The first term is the reconstruction quality (how well the decoder recovers x from z). The second term is the information cost — how many extra bits the encoder uses beyond the prior. The ELBO is a lower bound on log p(x), the log-likelihood of the data.

Bits-back argument. The KL term has a coding interpretation: it's the number of bits you "waste" by encoding z using the posterior q(z|x) instead of the prior p(z). The clever insight of bits-back coding is that you can recover these bits if you have random bits to "spend" on encoding — hence the name "bits back." This makes the KL term not a pure cost but a net cost.

For Gaussian encoder q(z|x) = N(μ, σ²) and standard normal prior p(z) = N(0,1), the KL has a closed form:

D_KL(N(μ₁, σ₁²) || N(μ₂, σ₂²)) = log(σ₂/σ₁) + (σ₁² + (μ₁−μ₂)²) / (2σ₂²) − 0.5

Exercise 8.1: KL to Standard Normal Derive

The VAE encoder outputs μ = 0.5, σ = 0.8 for one latent dimension. The prior is N(0,1). Compute the KL divergence in nats.

For q = N(μ, σ²) vs p = N(0,1): KL = −0.5·(1 + log(σ²) − μ² − σ²). Use ln(0.64) ≈ −0.446.

nats

Show derivation

KL = −0.5 × (1 + ln(0.64) − 0.25 − 0.64)

= −0.5 × (1 − 0.446 − 0.25 − 0.64)

= −0.5 × (−0.336) = 0.168

The encoder is moderately close to the prior — μ is only 0.5 away from 0, and σ = 0.8 is fairly close to 1. The KL cost per dimension is small. In a VAE with hundreds of latent dimensions, these small per-dimension costs add up.

Exercise 8.2: KL Between Two Gaussians Derive

Compute D_KL(N(1, 0.5²) || N(0, 1²)). Use the full formula: KL = ln(σ₂/σ₁) + (σ₁² + (μ₁−μ₂)²)/(2σ₂²) − 0.5.

nats

Show derivation

μ₁=1, σ₁=0.5, μ₂=0, σ₂=1

KL = ln(1/0.5) + (0.25 + 1)/(2×1) − 0.5

= ln(2) + 1.25/2 − 0.5

= 0.693 + 0.625 − 0.5 = 0.818 nats

Exercise 8.3: ELBO Decomposition Trace

A VAE has reconstruction loss = 50.0 nats and KL loss = 3.2 nats. What is the ELBO, and what does maximizing it mean?

ELBO = 53.2; maximize by increasing both terms ELBO = −53.2; maximize by making reconstruction good (less negative) and KL small ELBO = 46.8; maximize by making reconstruction dominate KL ELBO = −46.8; maximize by reducing reconstruction loss

Show reasoning

ELBO = E[log p(x|z)] − KL. The reconstruction term is a negative log-likelihood, typically reported as a positive loss. So ELBO = −50.0 − 3.2 = −53.2. Maximizing the ELBO (making it less negative) means: (1) improve reconstruction (decrease the 50.0), and (2) keep the posterior close to the prior (decrease the 3.2). In practice, we minimize −ELBO = 53.2.

Exercise 8.4: KL Annealing Trace

In "posterior collapse," the VAE ignores z entirely (q(z|x) ≈ p(z) for all x). What are the KL and reconstruction losses in this regime?

KL ≈ 0 (posterior matches prior), reconstruction is poor (decoder ignores z) KL ≈ 0, reconstruction is also good KL is large, reconstruction is poor KL is large, reconstruction is good

Show reasoning

Posterior collapse: q(z|x) = p(z) regardless of x, so KL = 0 — the "information cost" is zero because the encoder transmits zero information. But this means the decoder gets no useful signal from z and must model p(x) entirely on its own, giving poor reconstruction. The fix: KL annealing — start with low KL weight and gradually increase it, letting the encoder learn useful representations before being penalized for diverging from the prior.

Exercise 8.5: Multi-Dimensional KL Derive

A VAE has latent dimension d = 3. The encoder outputs μ = (0.5, −0.3, 0.8) and σ = (0.7, 1.2, 0.4). The prior is N(0, I). Compute the total KL (sum over dimensions).

For each dimension j: KL_j = −0.5(1 + ln(σ_j²) − μ_j² − σ_j²). Sum them.

nats

Show derivation

Dim 0: −0.5(1 + ln(0.49) − 0.25 − 0.49) = −0.5(1 − 0.713 − 0.25 − 0.49) = −0.5(−0.453) = 0.227

Dim 1: −0.5(1 + ln(1.44) − 0.09 − 1.44) = −0.5(1 + 0.365 − 0.09 − 1.44) = −0.5(−0.165) = 0.083

Dim 2: −0.5(1 + ln(0.16) − 0.64 − 0.16) = −0.5(1 − 1.833 − 0.64 − 0.16) = −0.5(−1.633) = 0.817

Total = 0.227 + 0.083 + 0.816 = 1.126 nats. Dimension 2 dominates: it has both a shifted mean (μ=0.8) and a narrow variance (σ=0.4), making it most different from the N(0,1) prior. Dimension 1 contributes least because its wider variance (σ=1.2) partially offsets the small mean shift.

Exercise 8.6: Find the Bug Debug

This VAE KL computation produces negative values. Click the buggy line.

function vaeKL(mu, logvar) {
  // mu, logvar: arrays of length d
  let kl = 0;
  for (let i = 0; i < mu.length; i++) {
    const sigma2 = Math.exp(logvar[i]);
    kl += 0.5 * (1 + logvar[i] - mu[i]*mu[i] - sigma2);
  }
  return kl;
}

Show explanation

Line 6 computes the negative ELBO contribution (which is negative, not the KL itself). The KL divergence is KL = −0.5·(1 + log(σ²) − μ² − σ²). The line is missing the negative sign. It should be: kl += -0.5 * (1 + logvar[i] - mu[i]*mu[i] - sigma2);. Without the negation, the sum is negative (since σ² + μ² > 1 + log(σ²) for typical values), making KL appear negative, which violates KL ≥ 0.

Chapter 9: Capstone: Compression Analysis

You've built the full toolkit. Now let's analyze a real-world compression scenario end-to-end: from character frequencies to Huffman bounds to LLM-as-compressor. This is the kind of back-of-envelope analysis that separates someone who understands information theory from someone who just memorized formulas.

The capstone challenge. A text file contains 10,000 characters with the following frequency counts: E:1270, T:906, A:817, O:751, I:697, N:675, S:633, H:609, R:599, space:2043, other:1000 (all remaining characters lumped together). You'll compute entropy, Huffman bounds, and connect it all to LLM compression.

Exercise 9.1: Character Entropy Derive

Convert the frequency counts to probabilities (divide by 10,000) and compute H(X). The probabilities are: space=0.2043, E=0.127, T=0.0906, A=0.0817, O=0.0751, I=0.0697, N=0.0675, S=0.0633, H=0.0609, R=0.0599, other=0.1.

Use the formula H = −∑ p_i·log₂(p_i). This has 11 terms.

bits/char

Show derivation

H = −[0.2043·log₂(0.2043) + 0.127·log₂(0.127) + 0.0906·log₂(0.0906)

+ 0.0817·log₂(0.0817) + 0.0751·log₂(0.0751) + 0.0697·log₂(0.0697)

+ 0.0675·log₂(0.0675) + 0.0633·log₂(0.0633) + 0.0609·log₂(0.0609)

+ 0.0599·log₂(0.0599) + 0.1·log₂(0.1)]

≈ 0.484 + 0.383 + 0.316 + 0.296 + 0.280 + 0.267 + 0.261 + 0.250 + 0.244 + 0.241 + 0.332

= 3.354 bits/char (approximately)

This is the zero-order (unigram) entropy — treating each character as independent. Real English text has much lower entropy (~1.0–1.5 bits/char) because of sequential dependencies. The gap between unigram entropy and true entropy is what n-gram and neural language models exploit.

Exercise 9.2: Theoretical Minimum File Size Derive

Given the unigram entropy from 9.1 (~3.3 bits/char), what is the theoretical minimum file size for the 10,000-character file? Express in bytes (8 bits = 1 byte). Compare to the naive ASCII size (8 bits/char).

bytes (minimum)

Show derivation

Minimum bits = 10,000 × 3.3 = 33,000 bits

Minimum bytes = 33,000 / 8 = 4,125 bytes

ASCII size = 10,000 × 8 / 8 = 10,000 bytes

Compression ratio = 10,000 / 4,125 = 2.42:1

We can compress to ~41% of the original size using just unigram statistics. Real compressors (gzip, zstd) do even better because they exploit sequential patterns. The Shannon limit with full context is ~1.2 bits/char × 10,000 = 1,500 bytes — a 6.67:1 ratio.

Exercise 9.3: Huffman vs Arithmetic Trace

Huffman coding is limited to integer bit lengths per symbol. Arithmetic coding can achieve fractional bits per symbol. For our 11-symbol alphabet with H ≈ 3.3 bits/symbol, which statement is correct?

Huffman and arithmetic achieve identical compression Huffman is guaranteed to be within 0.01 bits of entropy Huffman is within 1 bit of entropy per symbol; arithmetic coding approaches entropy exactly Arithmetic coding is always worse due to rounding errors

Show reasoning

Huffman: H ≤ L < H + 1 (per symbol). Arithmetic coding can approach H to within O(1/n) total bits for a sequence of length n. For long sequences, arithmetic coding is near-optimal. This is why modern compressors (zlib, zstd, ANS) use arithmetic-coding variants rather than Huffman.

Exercise 9.4: LLM as Compressor Derive

A language model achieves perplexity 15.3 on a test set. Convert this to bits per character (assuming character-level tokenization). Perplexity = 2^H where H is the cross-entropy in bits.

bits/char

Show derivation

PPL = 2^H ⇒ H = log₂(PPL) = log₂(15.3)

= 3.935 bits/char

This model is worse than unigram Huffman (3.3 bits/char)! It must be a bad model or the test set is unusual. GPT-4-class models achieve ~0.7–0.9 bits per character on standard English, approaching the Shannon limit. A perplexity of 15.3 at the character level is quite high.

Exercise 9.5: Subword Perplexity to BPC Derive

Most LLMs use subword tokenization. GPT-2 achieves perplexity 29.4 on WikiText-103 with an average of 1.3 tokens per word and 4.5 characters per word. Convert to bits per character.

bits/token = log₂(PPL). tokens/char = 1.3/4.5. bits/char = bits/token × tokens/char.

bits/char

Show derivation

bits/token = log₂(29.4) = 4.877

tokens/char = 1.3 / 4.5 = 0.2889

bits/char = 4.877 × 0.2889 = 1.409

~1.4 bits per character is in the range of Shannon's estimate for English (1.0–1.5 bits/char). This means GPT-2 captures most of the statistical structure of English text. Better models (GPT-4) push this even lower, suggesting the true entropy of English may be below 1.0 bits/char.

Exercise 9.6: Compression Comparison Derive

For our 10,000-character file, compute the compressed size (in bytes) for each method: (a) Unigram Huffman at 3.3 bits/char, (b) gzip-like at ~2.0 bits/char, (c) LLM-based at 1.4 bits/char. Express all as bytes.

bytes (LLM method)

Show derivation

(a) Huffman: 10,000 × 3.3 / 8 = 4,125 bytes

(b) gzip: 10,000 × 2.0 / 8 = 2,500 bytes

From 10 KB original to 1.75 KB — a 5.7:1 compression ratio. The progression shows how each level of modeling captures more structure: unigram frequencies → local patterns (gzip) → full linguistic context (LLM). This is why "language modeling is compression" is not just a metaphor — a better language model is literally a better compressor.

Exercise 9.7: Arrange the Compression Stack Design

Order these compression approaches from worst to best compression ratio on English text.

→

ASCII (8 bits/char) Huffman (~3.3 bits) gzip/LZ77 (~2.0 bits) LLM-based (~1.0 bits) Shannon limit (~0.8 bits)

Show correct order

Worst → Best compression: ASCII (no compression) → Huffman (unigram model) → gzip (local pattern matching) → LLM (full context model) → Shannon limit (theoretical floor). Each step exploits more structure in the data. The Shannon limit is unreachable but approachable.

Exercise 9.8: Implement bitsPerChar() Build

Write a function that takes a perplexity value and characters-per-token ratio, and returns bits per character.

bits/token = log2(PPL). bits/char = bits/token / chars/token.

Show solution

javascript
function bitsPerChar(perplexity, charsPerToken) {
  const bitsPerToken = Math.log2(perplexity);
  return bitsPerToken / charsPerToken;
}

Information Theory

Chapter 0: Entropy

Chapter 1: Joint & Conditional Entropy

Chapter 2: KL Divergence

Chapter 3: Cross-Entropy

Chapter 4: Source Coding

Chapter 5: Channel Capacity

Chapter 6: Rate-Distortion

Chapter 7: Mutual Information in ML

Chapter 8: Bits-Back Coding & VAE

Chapter 9: Capstone: Compression Analysis