How much information does a pair contain? How much does knowing y help describe x? The Kolmogorov-Levin chain rule ties it together.
You have two strings, x and y. How much information does the pair (x, y) contain in total? And how much does knowing x help you describe y?
Consider two scenarios:
Kolmogorov complexity handles this elegantly. The complexity of the pair, C(x, y), captures the total information. The conditional complexity C(y|x) — the shortest description of y when x is given for free — captures how much new information y adds beyond x.
To define C(x, y), we first need a computable way to encode a pair of strings as a single string. We need a computable bijection [·, ·]: Σ* × Σ* → Σ*.
One natural approach: encode the length of x, then x, then y. But how do you encode the length? You need a self-delimiting encoding — one where the decoder knows where the length ends and the data begins.
This encoding is inefficient for long x. A better approach: use x̅̅ where you self-delimit the length of x instead of x itself.
Here l(x) is the binary representation of |x|, and l(x)̅ is its self-delimiting version. Since |l(x)| = log |x| + O(1), the overhead is 2 log |x| + O(1). So |[x, y]| = |x| + |y| + 2 log |x| + O(1).
We can push the overhead even further down using iterated length encoding: encode the length of the length of x, etc. This gives |[x, y]| = |x| + |y| + log |x| + 2 log log |x| + O(1). But the 2 log |x| version is standard and sufficient for most purposes.
The Kolmogorov complexity of a pair is defined as C(x, y) = C([x, y]) — the complexity of the encoding of the pair. This is well-defined up to O(1) regardless of which specific encoding we use, since all computable bijections differ by O(1).
Proof: Let p be a shortest description of x and q a shortest description of y. Then p̅q (self-delimiting p followed by q) is a description of the pair. Its length is |p| + |q| + 2 log |p| + O(1) = C(x) + C(y) + 2 log C(x) + O(1).
The logarithmic overhead is the cost of "separating" the two descriptions. It's the price you pay for not knowing in advance how long x's description is.
Note: C(x, y) can be much less than C(x) + C(y) when x and y share information. The extreme case: C(x, x) = C(x) + O(log C(x)), not 2C(x). The pair (x, x) contains no more information than x alone.
The conditional Kolmogorov complexity C(x|y) is the length of the shortest program that, given y as auxiliary input, outputs x. Formally:
Here D is now a conditional decompressor: a partial computable function of two arguments. The optimal conditional decompressor exists by the same universality argument.
Intuitively, C(x|y) measures how much additional information x contains beyond what's in y. If y = x, then C(x|y) = O(1) — given yourself, you need zero information. If y is independent of x, then C(x|y) ≈ C(x) — knowing y doesn't help.
All the basic properties carry over: C(x|y) is not computable, is upper semicomputable, satisfies the counting argument (for each y and n, fewer than 2n strings x have C(x|y) < n), and is robust under computable transformations.
We can now define the algorithmic mutual information between x and y:
The second equality is the symmetry of information, a consequence of the Kolmogorov-Levin theorem. It says that x provides as much information about y as y provides about x (up to logarithmic terms).
This parallels Shannon's mutual information, but for individual strings rather than random variables. The analogy goes deep:
| Shannon | Kolmogorov |
|---|---|
| H(X) | C(x) |
| H(X|Y) | C(x|y) |
| H(X, Y) | C(x, y) |
| I(X; Y) = H(X) − H(X|Y) | I(x : y) = C(x) − C(x|y) |
| H(X, Y) = H(X) + H(Y|X) | C(x, y) = C(x) + C(y|x) + O(log n) |
The chain rule for Shannon entropy holds exactly: H(X, Y) = H(X) + H(Y|X). For Kolmogorov complexity, it holds up to O(log n). The logarithmic error is the cost of encoding the "separator" between the two parts of the description.
The Kolmogorov-Levin theorem is the crown jewel of this chapter. Let's state it precisely.
This says the shortest description of the pair (x, y) consists of three parts:
The upper bound (C(x, y) ≤ C(x) + C(y|x) + O(log n)) is straightforward: describe x, self-delimit its description, then describe y given x. The self-delimiting adds O(log n) bits.
The lower bound (C(x, y) ≥ C(x) + C(y|x) − O(log n)) is the hard part. It says you cannot do significantly better than this decomposition. We'll sketch the proof next.
The proof of the lower bound is the first genuinely clever argument in the theory. Here's the idea.
Let a = C(x, y). Consider the set A of all pairs (u, v) with C(u, v) ≤ a. By the counting argument, |A| ≤ 2a+1. Our pair (x, y) is one element of this set.
For each string t, define the section At = {v : (t, v) ∈ A}. Think of A as a 2D region and At as a vertical slice at position t.
Adding the two bounds: C(x) + C(y|x) ≤ (a − m) + m + O(log n) = a + O(log n) = C(x, y) + O(log n). Done.
The key trick is the section argument: large sections mean y is easy to describe given x (few bits needed for the index), but x must be in a "popular" first coordinate, and there are few of those. Small sections mean x could be anywhere, but y is hard to specify. The total always adds up to a = C(x, y).
The chain rule gives us two decompositions:
Subtracting, we get the symmetry of algorithmic information:
Or equivalently: I(x : y) = I(y : x) + O(log n). The amount of information x has about y equals the amount y has about x, up to O(log n).
The O(log n) error term is unavoidable. One can construct pairs where C(x, y) − C(x) − C(y|x) is as large as log n, and pairs where it is as negative as −log n. The logarithmic gap is the price of not knowing the separator position in advance.
The simulation below illustrates the chain rule decomposition. For a pair (x, y), it shows how the total information C(x, y) splits into C(x) (information about x alone) and C(y|x) (the additional information in y beyond x).
Choose a relationship between x and y. The diagram shows how C(x,y) decomposes into C(x) + C(y|x) + O(log n). The overlap region represents mutual information.
| Result | Statement |
|---|---|
| Pair encoding | [x, y] encodes a pair; overhead is 2 log|x| + O(1) |
| Pair upper bound | C(x,y) ≤ C(x) + C(y) + 2 log C(x) + O(1) |
| Conditional complexity | C(x|y) = min{|p| : D(p,y) = x} |
| Chain rule | C(x,y) = C(x) + C(y|x) + O(log n) |
| Symmetry | C(x) − C(x|y) = C(y) − C(y|x) + O(log n) |
| Mutual information | I(x:y) = C(x) − C(x|y), and I(x:y) = I(y:x) + O(log n) |