Shen, Uspensky & Vereshchagin — Chapter 2

Pairs & Conditional Complexity

How much information does a pair contain? How much does knowing y help describe x? The Kolmogorov-Levin chain rule ties it together.

Prerequisites: Chapter 1 (plain complexity, counting argument, upper semicomputability).
10
Chapters
2
Simulations
10
Quizzes

Chapter 0: The Problem

You have two strings, x and y. How much information does the pair (x, y) contain in total? And how much does knowing x help you describe y?

Consider two scenarios:

Kolmogorov complexity handles this elegantly. The complexity of the pair, C(x, y), captures the total information. The conditional complexity C(y|x) — the shortest description of y when x is given for free — captures how much new information y adds beyond x.

The chain rule (preview): C(x, y) = C(x) + C(y|x) + O(log n). The information in a pair equals the information in x plus the information in y that isn't already in x. This is the Kolmogorov-Levin theorem, the first deep result in the theory.
Check: If x and y are independent random strings of length n, roughly how large is C(x, y)?

Chapter 1: Encoding Pairs

To define C(x, y), we first need a computable way to encode a pair of strings as a single string. We need a computable bijection [·, ·]: Σ* × Σ* → Σ*.

One natural approach: encode the length of x, then x, then y. But how do you encode the length? You need a self-delimiting encoding — one where the decoder knows where the length ends and the data begins.

Standard encoding: Write x̅ = the bits of x with each bit doubled (0 → 00, 1 → 11), followed by 01 as a terminator. Then [x, y] = x̅y. The overhead is |x̅| − |x| = |x| + 2 bits, so |[x, y]| = 2|x| + 2 + |y|.

This encoding is inefficient for long x. A better approach: use x̅̅ where you self-delimit the length of x instead of x itself.

[x, y] = l(x)̅ · x · y

Here l(x) is the binary representation of |x|, and l(x)̅ is its self-delimiting version. Since |l(x)| = log |x| + O(1), the overhead is 2 log |x| + O(1). So |[x, y]| = |x| + |y| + 2 log |x| + O(1).

We can push the overhead even further down using iterated length encoding: encode the length of the length of x, etc. This gives |[x, y]| = |x| + |y| + log |x| + 2 log log |x| + O(1). But the 2 log |x| version is standard and sufficient for most purposes.

Check: Why can't we just concatenate x and y to encode the pair?

Chapter 2: Pair Complexity

The Kolmogorov complexity of a pair is defined as C(x, y) = C([x, y]) — the complexity of the encoding of the pair. This is well-defined up to O(1) regardless of which specific encoding we use, since all computable bijections differ by O(1).

Theorem 16 (Upper bound on pairs): C(x, y) ≤ C(x) + C(y) + 2 log C(x) + O(1).

Proof: Let p be a shortest description of x and q a shortest description of y. Then p̅q (self-delimiting p followed by q) is a description of the pair. Its length is |p| + |q| + 2 log |p| + O(1) = C(x) + C(y) + 2 log C(x) + O(1).

The logarithmic overhead is the cost of "separating" the two descriptions. It's the price you pay for not knowing in advance how long x's description is.

Note: C(x, y) can be much less than C(x) + C(y) when x and y share information. The extreme case: C(x, x) = C(x) + O(log C(x)), not 2C(x). The pair (x, x) contains no more information than x alone.

Check: Why does C(x, y) ≤ C(x) + C(y) + 2 log C(x) have a logarithmic overhead?

Chapter 3: Conditional Complexity

The conditional Kolmogorov complexity C(x|y) is the length of the shortest program that, given y as auxiliary input, outputs x. Formally:

C(x|y) = min{ |p| : D(p, y) = x }

Here D is now a conditional decompressor: a partial computable function of two arguments. The optimal conditional decompressor exists by the same universality argument.

Intuitively, C(x|y) measures how much additional information x contains beyond what's in y. If y = x, then C(x|y) = O(1) — given yourself, you need zero information. If y is independent of x, then C(x|y) ≈ C(x) — knowing y doesn't help.

Conditional complexity vs. unconditional: C(x|y) ≤ C(x) + O(1) always (you can ignore the hint y). The gap C(x) − C(x|y) measures how much information y provides about x.

All the basic properties carry over: C(x|y) is not computable, is upper semicomputable, satisfies the counting argument (for each y and n, fewer than 2n strings x have C(x|y) < n), and is robust under computable transformations.

Check: If C(x|y) = 5 and C(x) = 100, what does this tell us?

Chapter 4: Complexity as Information

We can now define the algorithmic mutual information between x and y:

I(x : y) = C(x) − C(x|y) = C(y) − C(y|x) + O(log n)

The second equality is the symmetry of information, a consequence of the Kolmogorov-Levin theorem. It says that x provides as much information about y as y provides about x (up to logarithmic terms).

This parallels Shannon's mutual information, but for individual strings rather than random variables. The analogy goes deep:

ShannonKolmogorov
H(X)C(x)
H(X|Y)C(x|y)
H(X, Y)C(x, y)
I(X; Y) = H(X) − H(X|Y)I(x : y) = C(x) − C(x|y)
H(X, Y) = H(X) + H(Y|X)C(x, y) = C(x) + C(y|x) + O(log n)

The chain rule for Shannon entropy holds exactly: H(X, Y) = H(X) + H(Y|X). For Kolmogorov complexity, it holds up to O(log n). The logarithmic error is the cost of encoding the "separator" between the two parts of the description.

Check: The algorithmic mutual information I(x : y) measures...

Chapter 5: The Chain Rule

The Kolmogorov-Levin theorem is the crown jewel of this chapter. Let's state it precisely.

Kolmogorov-Levin Theorem: For all strings x, y of length at most n:
C(x, y) = C(x) + C(y|x) + O(log n)

This says the shortest description of the pair (x, y) consists of three parts:

  1. A shortest description of x: C(x) bits.
  2. A shortest conditional description of y given x: C(y|x) bits.
  3. A "glue" overhead of O(log n) bits.

The upper bound (C(x, y) ≤ C(x) + C(y|x) + O(log n)) is straightforward: describe x, self-delimit its description, then describe y given x. The self-delimiting adds O(log n) bits.

The lower bound (C(x, y) ≥ C(x) + C(y|x) − O(log n)) is the hard part. It says you cannot do significantly better than this decomposition. We'll sketch the proof next.

Check: The chain rule C(x,y) = C(x) + C(y|x) + O(log n) — which direction is easy?

Chapter 6: Proof Sketch

The proof of the lower bound is the first genuinely clever argument in the theory. Here's the idea.

Let a = C(x, y). Consider the set A of all pairs (u, v) with C(u, v) ≤ a. By the counting argument, |A| ≤ 2a+1. Our pair (x, y) is one element of this set.

For each string t, define the section At = {v : (t, v) ∈ A}. Think of A as a 2D region and At as a vertical slice at position t.

Step 1
Let m = log|Ax|. The section at x has about 2m elements.
Step 2
C(y|x) ≤ m + O(log n). Given x and a, we can enumerate Ax. Then y's index in this enumeration takes m bits.
Step 3
C(x) ≤ (a − m) + O(log n). The set of t with |At| ≥ 2m has at most 2a+1/2m = 2a−m+1 elements. Given a and m, we can enumerate this set, so x's index takes a − m bits.

Adding the two bounds: C(x) + C(y|x) ≤ (a − m) + m + O(log n) = a + O(log n) = C(x, y) + O(log n). Done.

The key trick is the section argument: large sections mean y is easy to describe given x (few bits needed for the index), but x must be in a "popular" first coordinate, and there are few of those. Small sections mean x could be anywhere, but y is hard to specify. The total always adds up to a = C(x, y).

Combinatorial translation: The proof corresponds to a simple combinatorial fact: for a set A of pairs with |A| ≤ p · q, we can split A into P (where the projection onto the first coordinate has at most p elements) and Q (where every section has at most q elements). This is the Kolmogorov-Levin theorem in disguise.
Check: In the proof, what role does the "section" Ax play?

Chapter 7: Symmetry of Information

The chain rule gives us two decompositions:

C(x, y) = C(x) + C(y|x) + O(log n) = C(y) + C(x|y) + O(log n)

Subtracting, we get the symmetry of algorithmic information:

C(x) − C(x|y) = C(y) − C(y|x) + O(log n)

Or equivalently: I(x : y) = I(y : x) + O(log n). The amount of information x has about y equals the amount y has about x, up to O(log n).

Why this is remarkable: Imagine x is a 1-gigabyte genome and y is a 10-byte fingerprint extracted from it. The fingerprint y gives you 10 bytes of information about the genome — and the genome gives you exactly the same 10 bytes of information about the fingerprint. This is not obvious! The relationship is symmetric even when the objects are vastly different in size.

The O(log n) error term is unavoidable. One can construct pairs where C(x, y) − C(x) − C(y|x) is as large as log n, and pairs where it is as negative as −log n. The logarithmic gap is the price of not knowing the separator position in advance.

Check: Symmetry of information means...

Chapter 8: Chain Rule Visualizer

The simulation below illustrates the chain rule decomposition. For a pair (x, y), it shows how the total information C(x, y) splits into C(x) (information about x alone) and C(y|x) (the additional information in y beyond x).

Chain Rule Decomposition

Choose a relationship between x and y. The diagram shows how C(x,y) decomposes into C(x) + C(y|x) + O(log n). The overlap region represents mutual information.

What to notice: When x and y are independent, the circles barely overlap — C(x,y) ≈ C(x) + C(y). When they're identical, the circles fully overlap — C(x,y) ≈ C(x). The correlated case is in between. When y = f(x) for a computable f, C(y|x) = O(1), so C(x,y) ≈ C(x).
Check: If y = f(x) for a known computable f, what is C(y|x)?

Chapter 9: Summary

ResultStatement
Pair encoding[x, y] encodes a pair; overhead is 2 log|x| + O(1)
Pair upper boundC(x,y) ≤ C(x) + C(y) + 2 log C(x) + O(1)
Conditional complexityC(x|y) = min{|p| : D(p,y) = x}
Chain ruleC(x,y) = C(x) + C(y|x) + O(log n)
SymmetryC(x) − C(x|y) = C(y) − C(y|x) + O(log n)
Mutual informationI(x:y) = C(x) − C(x|y), and I(x:y) = I(y:x) + O(log n)
Looking ahead: Chapter 3 tackles the connection between complexity and randomness head-on. Martin-Löf's definition of randomness uses measure theory and computability to formalize "no pattern." The punchline: an infinite sequence is random iff its prefixes are incompressible — but making this precise requires prefix complexity (Chapter 4).
Final check: The O(log n) error in the chain rule is unavoidable because...