Two measures of information, from different worlds, meeting on common ground. Expected complexity equals entropy.
Shannon entropy and Kolmogorov complexity both measure "information," but they come from completely different traditions:
| Shannon Entropy H(X) | Kolmogorov Complexity C(x) | |
|---|---|---|
| Measures | Average info per message from a source | Info in a specific individual string |
| Inputs | A probability distribution | A single string |
| Framework | Probability theory | Computability theory |
| Computable? | Yes (if distribution is known) | No |
| Year | 1948 (Shannon) | 1965 (Kolmogorov) |
Despite these differences, they are deeply connected. This chapter builds the bridge: the expected Kolmogorov complexity of a random variable equals its Shannon entropy (up to O(1) for prefix complexity, up to O(log n) for plain).
Let X be a random variable taking values x1, ..., xk with probabilities p1, ..., pk. The Shannon entropy is:
(Convention: 0 log 0 = 0.) The entropy is measured in bits (using log base 2).
What it measures: H(X) is the minimum average number of bits needed to encode messages from X. If you design an optimal code, the average codeword length equals H(X) (Shannon's source coding theorem).
H(X) = 0 iff X is deterministic. H(X) = log k iff X is uniformly distributed over k values. For everything in between, 0 < H(X) < log k.
Shannon entropy satisfies beautiful algebraic properties:
The conditional entropy H(Y|X) measures the remaining uncertainty about Y after learning X:
It is the average entropy of Y over the conditional distributions. Not the entropy of Y given a specific x, but the expected value of that quantity.
The parallels between Shannon and Kolmogorov are now fully visible:
| Shannon | Kolmogorov (prefix) | Error |
|---|---|---|
| H(X,Y) = H(X) + H(Y|X) | K(x,y) = K(x) + K(y|x*) + O(1) | O(1) |
| I(X;Y) = I(Y;X) | I(x:y) = I(y:x) + O(log n) | O(log n) |
| H(X|Y) ≤ H(X) | K(x|y) ≤ K(x) + O(1) | O(1) |
| H(X|Y) ≥ 0 | K(x|y) ≥ 0 | exact |
Here is the central theorem connecting the two worlds:
The expected prefix complexity of X equals the Shannon entropy of X, up to O(1). Shannon entropy is the average of individual Kolmogorov complexities.
For plain complexity, the bound is weaker: E[C(X)] = H(X) + O(log n), where n bounds the length of the values.
E[K(X)] ≤ H(X) + O(1).
Proof sketch: construct a prefix-free code with codeword lengths ⌈−log p(x)⌉ for each value x. By the Kraft inequality, such a code exists because ∑ 2−⌈−log p(x)⌉ ≤ ∑ p(x) = 1. The expected codeword length is ∑ p(x) ⌈−log p(x)⌉ ≤ H(X) + 1.
Since K(x) is at most the length of this codeword (the optimal prefix machine is at least as good), we get E[K(X)] ≤ H(X) + O(1).
E[K(X)] ≥ H(X) − O(1).
This is the deeper direction. We use Kraft's inequality for the optimal prefix machine.
Since ∑x 2−K(x) ≤ 1, the function q(x) = 2−K(x) is a semimeasure. By the log-sum inequality (or Jensen's inequality):
By Gibbs' inequality, for any distribution p and any semimeasure q with ∑ q(x) ≤ 1:
This gives E[K(X)] ≥ H(X). (The Gibbs inequality says: no code beats the Shannon limit on average.) The O(1) comes from the constant relating the optimal prefix machine to the specific code used for p.
The bridge theorem has far-reaching consequences:
3. Typicality: For a random variable X with distribution p over n-bit strings, most outcomes x have K(x) ≈ H(X). The "typical set" in information theory corresponds to the set of strings with complexity near H(X). This is the complexity-theoretic version of Shannon's asymptotic equipartition property.
Explore how Shannon entropy varies with the distribution. The slider controls the probability of outcome A in a binary distribution (A vs B).
Drag the slider to change the probability p. Entropy is maximized at p = 0.5 (maximum uncertainty) and minimized at p = 0 or 1 (complete certainty).
| Result | Statement |
|---|---|
| Shannon entropy | H(X) = −∑ p(x) log p(x) |
| Chain rule | H(X,Y) = H(X) + H(Y|X) — exact |
| Bridge theorem | E[K(X)] = H(X) + O(1) |
| Upper bound | Via Shannon coding: codeword length ⌈−log p(x)⌉ |
| Lower bound | Via Gibbs/KL divergence non-negativity |
| Transfer principle | Shannon inequalities ⇒ Kolmogorov inequalities |