Ch 2: Pairs & Conditional Complexity — Kolmogorov Complexity

Chapter 0: The Problem

You have two strings, x and y. How much information does the pair (x, y) contain in total? And how much does knowing x help you describe y?

Consider two scenarios:

Independent strings: x is a random 100-bit string. y is another random 100-bit string, generated independently. Knowing x tells you nothing about y. You'd expect the pair to need about 200 bits to describe.
Correlated strings: x is a random 100-bit string, and y = x (the same string). Now the pair is fully determined by x alone — about 100 bits should suffice.

Kolmogorov complexity handles this elegantly. The complexity of the pair, C(x, y), captures the total information. The conditional complexity C(y|x) — the shortest description of y when x is given for free — captures how much new information y adds beyond x.

The chain rule (preview): C(x, y) = C(x) + C(y|x) + O(log n). The information in a pair equals the information in x plus the information in y that isn't already in x. This is the Kolmogorov-Levin theorem, the first deep result in the theory.

Check: If x and y are independent random strings of length n, roughly how large is C(x, y)?

About 2n (you need to describe both) About n (knowing one determines the other) About n/2

Chapter 1: Encoding Pairs

To define C(x, y), we first need a computable way to encode a pair of strings as a single string. We need a computable bijection [·, ·]: Σ* × Σ* → Σ*.

One natural approach: encode the length of x, then x, then y. But how do you encode the length? You need a self-delimiting encoding — one where the decoder knows where the length ends and the data begins.

Standard encoding: Write x̅ = the bits of x with each bit doubled (0 → 00, 1 → 11), followed by 01 as a terminator. Then [x, y] = x̅y. The overhead is |x̅| − |x| = |x| + 2 bits, so |[x, y]| = 2|x| + 2 + |y|.

This encoding is inefficient for long x. A better approach: use x̅̅ where you self-delimit the length of x instead of x itself.

[x, y] = l(x)̅ · x · y

Here l(x) is the binary representation of |x|, and l(x)̅ is its self-delimiting version. Since |l(x)| = log |x| + O(1), the overhead is 2 log |x| + O(1). So |[x, y]| = |x| + |y| + 2 log |x| + O(1).

We can push the overhead even further down using iterated length encoding: encode the length of the length of x, etc. This gives |[x, y]| = |x| + |y| + log |x| + 2 log log |x| + O(1). But the 2 log |x| version is standard and sufficient for most purposes.

Check: Why can't we just concatenate x and y to encode the pair?

Because we couldn't tell where x ends and y begins Because the result would be too long Because concatenation is not computable

Chapter 2: Pair Complexity

The Kolmogorov complexity of a pair is defined as C(x, y) = C([x, y]) — the complexity of the encoding of the pair. This is well-defined up to O(1) regardless of which specific encoding we use, since all computable bijections differ by O(1).

Theorem 16 (Upper bound on pairs): C(x, y) ≤ C(x) + C(y) + 2 log C(x) + O(1).

Proof: Let p be a shortest description of x and q a shortest description of y. Then p̅q (self-delimiting p followed by q) is a description of the pair. Its length is |p| + |q| + 2 log |p| + O(1) = C(x) + C(y) + 2 log C(x) + O(1).

The logarithmic overhead is the cost of "separating" the two descriptions. It's the price you pay for not knowing in advance how long x's description is.

Note: C(x, y) can be much less than C(x) + C(y) when x and y share information. The extreme case: C(x, x) = C(x) + O(log C(x)), not 2C(x). The pair (x, x) contains no more information than x alone.

Check: Why does C(x, y) ≤ C(x) + C(y) + 2 log C(x) have a logarithmic overhead?

Because we need to self-delimit one description to know where it ends Because compression is lossy Because of rounding in binary representation

Chapter 3: Conditional Complexity

The conditional Kolmogorov complexity C(x|y) is the length of the shortest program that, given y as auxiliary input, outputs x. Formally:

C(x|y) = min{ |p| : D(p, y) = x }

Here D is now a conditional decompressor: a partial computable function of two arguments. The optimal conditional decompressor exists by the same universality argument.

Intuitively, C(x|y) measures how much additional information x contains beyond what's in y. If y = x, then C(x|y) = O(1) — given yourself, you need zero information. If y is independent of x, then C(x|y) ≈ C(x) — knowing y doesn't help.

Conditional complexity vs. unconditional: C(x|y) ≤ C(x) + O(1) always (you can ignore the hint y). The gap C(x) − C(x|y) measures how much information y provides about x.

All the basic properties carry over: C(x|y) is not computable, is upper semicomputable, satisfies the counting argument (for each y and n, fewer than 2ⁿ strings x have C(x|y) < n), and is robust under computable transformations.

Check: If C(x|y) = 5 and C(x) = 100, what does this tell us?

y contains almost all the information needed to describe x x and y are independent x is a very short string

Chapter 4: Complexity as Information

We can now define the algorithmic mutual information between x and y:

I(x : y) = C(x) − C(x|y) = C(y) − C(y|x) + O(log n)

The second equality is the symmetry of information, a consequence of the Kolmogorov-Levin theorem. It says that x provides as much information about y as y provides about x (up to logarithmic terms).

This parallels Shannon's mutual information, but for individual strings rather than random variables. The analogy goes deep:

Shannon	Kolmogorov
H(X)	C(x)
H(X\|Y)	C(x\|y)
H(X, Y)	C(x, y)
I(X; Y) = H(X) − H(X\|Y)	I(x : y) = C(x) − C(x\|y)
H(X, Y) = H(X) + H(Y\|X)	C(x, y) = C(x) + C(y\|x) + O(log n)

The chain rule for Shannon entropy holds exactly: H(X, Y) = H(X) + H(Y|X). For Kolmogorov complexity, it holds up to O(log n). The logarithmic error is the cost of encoding the "separator" between the two parts of the description.

Check: The algorithmic mutual information I(x : y) measures...

How much C(x) decreases when y is given as a hint The length of the longer string The edit distance between x and y

Chapter 5: The Chain Rule

The Kolmogorov-Levin theorem is the crown jewel of this chapter. Let's state it precisely.

Kolmogorov-Levin Theorem: For all strings x, y of length at most n:
C(x, y) = C(x) + C(y|x) + O(log n)

This says the shortest description of the pair (x, y) consists of three parts:

A shortest description of x: C(x) bits.
A shortest conditional description of y given x: C(y|x) bits.
A "glue" overhead of O(log n) bits.

The upper bound (C(x, y) ≤ C(x) + C(y|x) + O(log n)) is straightforward: describe x, self-delimit its description, then describe y given x. The self-delimiting adds O(log n) bits.

The lower bound (C(x, y) ≥ C(x) + C(y|x) − O(log n)) is the hard part. It says you cannot do significantly better than this decomposition. We'll sketch the proof next.

Check: The chain rule C(x,y) = C(x) + C(y|x) + O(log n) — which direction is easy?

The upper bound (≤): concatenate descriptions with a self-delimiting separator The lower bound (≥): you always need at least C(x) + C(y|x) bits Both directions are equally hard

Chapter 6: Proof Sketch

The proof of the lower bound is the first genuinely clever argument in the theory. Here's the idea.

Let a = C(x, y). Consider the set A of all pairs (u, v) with C(u, v) ≤ a. By the counting argument, |A| ≤ 2^a+1. Our pair (x, y) is one element of this set.

For each string t, define the section A_t = {v : (t, v) ∈ A}. Think of A as a 2D region and A_t as a vertical slice at position t.

Step 1

Let m = log|A_x|. The section at x has about 2^m elements.

↓

Step 2

C(y|x) ≤ m + O(log n). Given x and a, we can enumerate A_x. Then y's index in this enumeration takes m bits.

↓

Step 3

C(x) ≤ (a − m) + O(log n). The set of t with |A_t| ≥ 2^m has at most 2^a+1/2^m = 2^a−m+1 elements. Given a and m, we can enumerate this set, so x's index takes a − m bits.

Adding the two bounds: C(x) + C(y|x) ≤ (a − m) + m + O(log n) = a + O(log n) = C(x, y) + O(log n). Done.

The key trick is the section argument: large sections mean y is easy to describe given x (few bits needed for the index), but x must be in a "popular" first coordinate, and there are few of those. Small sections mean x could be anywhere, but y is hard to specify. The total always adds up to a = C(x, y).

Combinatorial translation: The proof corresponds to a simple combinatorial fact: for a set A of pairs with |A| ≤ p · q, we can split A into P (where the projection onto the first coordinate has at most p elements) and Q (where every section has at most q elements). This is the Kolmogorov-Levin theorem in disguise.

Check: In the proof, what role does the "section" A_x play?

Its size determines how many bits are needed to specify y given x It counts how many total pairs exist It determines whether x is compressible

Chapter 7: Symmetry of Information

The chain rule gives us two decompositions:

C(x, y) = C(x) + C(y|x) + O(log n) = C(y) + C(x|y) + O(log n)

Subtracting, we get the symmetry of algorithmic information:

C(x) − C(x|y) = C(y) − C(y|x) + O(log n)

Or equivalently: I(x : y) = I(y : x) + O(log n). The amount of information x has about y equals the amount y has about x, up to O(log n).

Why this is remarkable: Imagine x is a 1-gigabyte genome and y is a 10-byte fingerprint extracted from it. The fingerprint y gives you 10 bytes of information about the genome — and the genome gives you exactly the same 10 bytes of information about the fingerprint. This is not obvious! The relationship is symmetric even when the objects are vastly different in size.

The O(log n) error term is unavoidable. One can construct pairs where C(x, y) − C(x) − C(y|x) is as large as log n, and pairs where it is as negative as −log n. The logarithmic gap is the price of not knowing the separator position in advance.

Check: Symmetry of information means...

C(x) = C(y) for all x, y C(x) − C(x|y) ≈ C(y) − C(y|x): x tells you as much about y as y tells you about x C(x, y) = C(y, x) exactly

Chapter 8: Chain Rule Visualizer

The simulation below illustrates the chain rule decomposition. For a pair (x, y), it shows how the total information C(x, y) splits into C(x) (information about x alone) and C(y|x) (the additional information in y beyond x).

Chain Rule Decomposition

Choose a relationship between x and y. The diagram shows how C(x,y) decomposes into C(x) + C(y|x) + O(log n). The overlap region represents mutual information.

What to notice: When x and y are independent, the circles barely overlap — C(x,y) ≈ C(x) + C(y). When they're identical, the circles fully overlap — C(x,y) ≈ C(x). The correlated case is in between. When y = f(x) for a computable f, C(y|x) = O(1), so C(x,y) ≈ C(x).

Check: If y = f(x) for a known computable f, what is C(y|x)?

O(1) — given x, just apply f C(f) — you need to describe f C(y) — knowing x doesn't help

Chapter 9: Summary

Result	Statement
Pair encoding	[x, y] encodes a pair; overhead is 2 log\|x\| + O(1)
Pair upper bound	C(x,y) ≤ C(x) + C(y) + 2 log C(x) + O(1)
Conditional complexity	C(x\|y) = min{\|p\| : D(p,y) = x}
Chain rule	C(x,y) = C(x) + C(y\|x) + O(log n)
Symmetry	C(x) − C(x\|y) = C(y) − C(y\|x) + O(log n)
Mutual information	I(x:y) = C(x) − C(x\|y), and I(x:y) = I(y:x) + O(log n)

Looking ahead: Chapter 3 tackles the connection between complexity and randomness head-on. Martin-Löf's definition of randomness uses measure theory and computability to formalize "no pattern." The punchline: an infinite sequence is random iff its prefixes are incompressible — but making this precise requires prefix complexity (Chapter 4).

Final check: The O(log n) error in the chain rule is unavoidable because...

Compression is inherently lossy You need O(log n) bits to encode the separator between the two descriptions C(x) is not computable