Shen, Uspensky & Vereshchagin — Chapter 7

Shannon Entropy &
Kolmogorov Complexity

Two measures of information, from different worlds, meeting on common ground. Expected complexity equals entropy.

Prerequisites: Chapters 1–4 (plain and prefix complexity). Basic probability (random variables, expectation).
10
Chapters
2
Simulations
10
Quizzes

Chapter 0: Two Worlds

Shannon entropy and Kolmogorov complexity both measure "information," but they come from completely different traditions:

Shannon Entropy H(X)Kolmogorov Complexity C(x)
MeasuresAverage info per message from a sourceInfo in a specific individual string
InputsA probability distributionA single string
FrameworkProbability theoryComputability theory
Computable?Yes (if distribution is known)No
Year1948 (Shannon)1965 (Kolmogorov)

Despite these differences, they are deeply connected. This chapter builds the bridge: the expected Kolmogorov complexity of a random variable equals its Shannon entropy (up to O(1) for prefix complexity, up to O(log n) for plain).

The bridge theorem (preview): If X is a random variable taking values in a finite set with distribution p, then E[K(X)] = H(X) + O(1). Shannon entropy is the average case of Kolmogorov complexity.
Check: Shannon entropy measures information in...

Chapter 1: Shannon Entropy

Let X be a random variable taking values x1, ..., xk with probabilities p1, ..., pk. The Shannon entropy is:

H(X) = −∑i pi log pi

(Convention: 0 log 0 = 0.) The entropy is measured in bits (using log base 2).

What it measures: H(X) is the minimum average number of bits needed to encode messages from X. If you design an optimal code, the average codeword length equals H(X) (Shannon's source coding theorem).

Examples:
• Fair coin: H = −(1/2)log(1/2) − (1/2)log(1/2) = 1 bit. Each flip gives 1 bit of information.
• Biased coin (p=0.9): H = −0.9 log 0.9 − 0.1 log 0.1 ≈ 0.47 bits. Less surprise, less info.
• Deterministic (p=1): H = 0 bits. No surprise at all.
• Uniform over 256 symbols: H = log 256 = 8 bits. Maximum entropy.

H(X) = 0 iff X is deterministic. H(X) = log k iff X is uniformly distributed over k values. For everything in between, 0 < H(X) < log k.

Check: The entropy of a fair 6-sided die is...

Chapter 2: Properties of H

Shannon entropy satisfies beautiful algebraic properties:

Comparison with Kolmogorov: Every Shannon property has a Kolmogorov analogue, but with additive error terms. Shannon's chain rule is exact; Kolmogorov's has O(log n) or O(1) error. Shannon's "conditioning reduces entropy" holds exactly; for Kolmogorov, C(x|y) ≤ C(x) + O(1) (not exactly ≤). The error terms are the "noise" introduced by working with individual objects instead of distributions.
Check: The chain rule for Shannon entropy H(X,Y) = H(X) + H(Y|X) holds...

Chapter 3: Pairs and Conditional Entropy

The conditional entropy H(Y|X) measures the remaining uncertainty about Y after learning X:

H(Y|X) = ∑x P(X=x) · H(Y|X=x)

It is the average entropy of Y over the conditional distributions. Not the entropy of Y given a specific x, but the expected value of that quantity.

Mutual information: I(X;Y) = H(X) − H(X|Y) = H(Y) − H(Y|X) = H(X) + H(Y) − H(X,Y). This is symmetric: I(X;Y) = I(Y;X), exactly. Compare with Kolmogorov's I(x:y) = I(y:x) + O(log n) — the symmetry is approximate.

The parallels between Shannon and Kolmogorov are now fully visible:

ShannonKolmogorov (prefix)Error
H(X,Y) = H(X) + H(Y|X)K(x,y) = K(x) + K(y|x*) + O(1)O(1)
I(X;Y) = I(Y;X)I(x:y) = I(y:x) + O(log n)O(log n)
H(X|Y) ≤ H(X)K(x|y) ≤ K(x) + O(1)O(1)
H(X|Y) ≥ 0K(x|y) ≥ 0exact
Check: Shannon's mutual information I(X;Y) is symmetric...

Chapter 4: The Bridge

Here is the central theorem connecting the two worlds:

Theorem: Let X be a random variable taking values in a finite set Χ, with computable distribution p. Then:
E[K(X)] = H(X) + O(1)

The expected prefix complexity of X equals the Shannon entropy of X, up to O(1). Shannon entropy is the average of individual Kolmogorov complexities.

For plain complexity, the bound is weaker: E[C(X)] = H(X) + O(log n), where n bounds the length of the values.

Intuition: Shannon's source coding theorem says you need at least H(X) bits on average to encode messages. Prefix complexity gives the optimal encoding for each individual message. The optimal individual encoding averages out to the optimal average encoding. This is not a coincidence — it's a deep structural fact.
Check: For a random variable X with computable distribution, E[K(X)] equals...

Chapter 5: The Upper Bound

E[K(X)] ≤ H(X) + O(1).

Proof sketch: construct a prefix-free code with codeword lengths ⌈−log p(x)⌉ for each value x. By the Kraft inequality, such a code exists because ∑ 2−⌈−log p(x)⌉ ≤ ∑ p(x) = 1. The expected codeword length is ∑ p(x) ⌈−log p(x)⌉ ≤ H(X) + 1.

Since K(x) is at most the length of this codeword (the optimal prefix machine is at least as good), we get E[K(X)] ≤ H(X) + O(1).

Note: This is essentially Shannon's source coding theorem (achievability direction) — there exists a prefix-free code achieving average length H(X) + 1. Kolmogorov complexity inherits this bound because the optimal prefix machine can simulate any prefix code.
Check: The upper bound proof constructs a code with codeword length...

Chapter 6: The Lower Bound

E[K(X)] ≥ H(X) − O(1).

This is the deeper direction. We use Kraft's inequality for the optimal prefix machine.

Since ∑x 2−K(x) ≤ 1, the function q(x) = 2−K(x) is a semimeasure. By the log-sum inequality (or Jensen's inequality):

E[K(X)] = E[−log 2−K(X)] = −E[log q(X)]

By Gibbs' inequality, for any distribution p and any semimeasure q with ∑ q(x) ≤ 1:

−∑ p(x) log q(x) ≥ −∑ p(x) log p(x) = H(X)

This gives E[K(X)] ≥ H(X). (The Gibbs inequality says: no code beats the Shannon limit on average.) The O(1) comes from the constant relating the optimal prefix machine to the specific code used for p.

Gibbs' inequality: For any probability distribution p and any non-negative function q with ∑ q(x) ≤ 1, we have ∑ p(x) log(p(x)/q(x)) ≥ 0. This is the non-negativity of KL divergence, and it gives the lower bound on expected code length.
Check: The lower bound uses which key inequality?

Chapter 7: Implications

The bridge theorem has far-reaching consequences:

1. Shannon inequalities hold for Kolmogorov complexity. Any linear inequality satisfied by Shannon entropy for all distributions is also satisfied by Kolmogorov complexity (up to additive terms). For example, the subadditivity C(x,y) ≤ C(x) + C(y) + O(log n) follows from H(X,Y) ≤ H(X) + H(Y).
2. Non-Shannon inequalities exist for Kolmogorov complexity too. Zhang and Yeung (1998) found information inequalities not implied by Shannon's axioms. These carry over to Kolmogorov complexity. This shows that the complexity world is at least as rich as the entropy world.

3. Typicality: For a random variable X with distribution p over n-bit strings, most outcomes x have K(x) ≈ H(X). The "typical set" in information theory corresponds to the set of strings with complexity near H(X). This is the complexity-theoretic version of Shannon's asymptotic equipartition property.

Check: Every Shannon entropy inequality holds for Kolmogorov complexity because...

Chapter 8: Entropy Visualizer

Explore how Shannon entropy varies with the distribution. The slider controls the probability of outcome A in a binary distribution (A vs B).

Binary Entropy H(p, 1−p)

Drag the slider to change the probability p. Entropy is maximized at p = 0.5 (maximum uncertainty) and minimized at p = 0 or 1 (complete certainty).

p0.50
Check: At what value of p does the binary entropy H(p, 1−p) reach its maximum?

Chapter 9: Summary

ResultStatement
Shannon entropyH(X) = −∑ p(x) log p(x)
Chain ruleH(X,Y) = H(X) + H(Y|X) — exact
Bridge theoremE[K(X)] = H(X) + O(1)
Upper boundVia Shannon coding: codeword length ⌈−log p(x)⌉
Lower boundVia Gibbs/KL divergence non-negativity
Transfer principleShannon inequalities ⇒ Kolmogorov inequalities
Looking ahead: Chapter 8 applies Kolmogorov complexity to prove results in combinatorics, automata theory, and number theory. The incompressibility method — which uses the existence of incompressible strings as a proof technique — yields surprisingly clean proofs of classical theorems.
Final check: The bridge theorem E[K(X)] = H(X) + O(1) requires which version of complexity?