Shen, Uspensky & Vereshchagin — Chapter 4

A Priori Probability &
Prefix Complexity

Prefix-free codes, semimeasures, and the refined complexity measure K(x) that connects to randomness and probability.

Prerequisites: Chapters 1–3 (plain complexity, conditional complexity, Martin-Löf randomness).

Chapters

Simulations

Quizzes

Chapter 0: Why Prefix Complexity?

Plain complexity C(x) has a flaw: it doesn't behave well with probability. We'd like to say that "the probability that a random program outputs x" is related to 2^−C(x). But this fails because plain descriptions can overlap — a program of length 3 that outputs "hello" and a program of length 5 that outputs "hello" aren't related in any nice probabilistic way.

The fix is to require descriptions to be prefix-free: no description is a prefix of another. This is exactly the constraint that makes Kraft's inequality work, and it connects complexity to probability.

The key constraint: In a prefix-free code, the set of valid descriptions forms an antichain in the binary tree — no codeword is a prefix of another. This means ∑ 2^−|d| ≤ 1 (Kraft's inequality), which lets us interpret 2^−K(x) as a probability.

With plain complexity, ∑_x 2^−C(x) can be infinite (many strings can share the same description length). With prefix complexity K(x), the sum ∑_x 2^−K(x) is always at most 1 — it defines a semimeasure on strings. This connection to probability is why prefix complexity is the "right" version for randomness.

Check: The prefix-free constraint on descriptions ensures...

∑ 2^−K(x) ≤ 1, giving K(x) a probabilistic interpretation Descriptions are always shorter than the string K(x) is computable

Chapter 1: Semimeasures on N

A semimeasure on the natural numbers is a function m: N → [0, ∞) such that ∑_n m(n) ≤ 1. It's like a probability distribution, except the total mass may be less than 1 (some probability "leaks out").

Where do semimeasures come from? Consider a randomized algorithm: a program that flips coins as it runs. If the program always halts, the probability of each output n defines a measure. If it might not halt, some probability is "lost" to non-termination — and we get a semimeasure.

Randomized algorithm → semimeasure: Feed random bits to a program. With probability m(n), the program halts and outputs n. With probability 1 − ∑ m(n), it runs forever. The function m is a lower semicomputable semimeasure — we can approximate m(n) from below by running more and more random inputs.

A semimeasure m is lower semicomputable (or enumerable from below) if there is a computable sequence m₀(n) ≤ m₁(n) ≤ ... converging to m(n). We can approximate each m(n) from below but never know the exact value.

Check: A semimeasure differs from a probability measure in that...

Its total mass may be less than 1 It can take negative values It is defined only on finite sets

Chapter 2: Maximal Semimeasures

Just as there is an optimal description mode for complexity, there is a maximal lower semicomputable semimeasure. Call it M.

Theorem: There exists a lower semicomputable semimeasure M such that for any other lower semicomputable semimeasure m, there is a constant c with M(n) ≥ c · m(n) for all n.

Construction: Enumerate all programs p₁, p₂, .... Let m_i be the semimeasure generated by program p_i (with random input). Define M(n) = ∑_i 2⁻ⁱ · m_i(n). This is a weighted average over all lower semicomputable semimeasures. For any specific m_j, M(n) ≥ 2^−j · m_j(n).

The function M is called the a priori probability or Solomonoff's prior. M(n) is the probability that a universal randomized algorithm outputs n. Strings with high M(n) are those that "many programs" produce — simple strings. Strings with low M(n) are rare outputs — complex strings.

The fundamental connection: −log M(x) = K(x) + O(1). The negative log of the a priori probability equals the prefix complexity. Simple strings have high prior probability; complex strings have low prior probability. This is the bridge between information and probability.

Check: The a priori probability M(x) is highest for strings that are...

Simple (many programs output them) Long Random

Chapter 3: Prefix Machines

A prefix machine is a partial computable function D: Σ* → Σ* whose domain is prefix-free: if D(p) is defined and D(q) is defined, then neither p nor q is a prefix of the other.

Visually, think of the domain of D as a set of leaves in the infinite binary tree. Each leaf (valid input) maps to an output string. No input is an "ancestor" of another input in the tree.

Kraft's inequality for prefix machines: If D is a prefix machine, then ∑_{p ∈ dom(D)} 2^−|p| ≤ 1. This is because the domain forms an antichain in the binary tree, and the "measure" of an antichain cannot exceed 1.

The prefix complexity of x with respect to a prefix machine D is:

K_D(x) = min{ |p| : D(p) = x, p ∈ dom(D) }

An optimal prefix machine exists by the same universality argument. Fix one and define K(x) = K_D(x). Like C(x), K(x) is defined up to O(1) and is not computable.

Check: In a prefix machine, no valid input is...

A prefix of another valid input Longer than the output A binary string

Chapter 4: Self-Delimiting Input

There's an equivalent way to think about prefix machines: a machine that reads its input bit by bit from a one-way tape and decides on its own when to stop reading. The machine never "looks ahead" — it processes bits one at a time and halts after reading some prefix of the tape.

Analogy: A plain machine is like a file that needs a header specifying its length. A prefix (self-delimiting) machine is like a streaming protocol that knows when the message is complete. The message carries its own length information embedded in its content.

This model makes the prefix-free condition natural. If the machine halts after reading bits p, it never reads any extension of p. So the set of inputs on which the machine halts forms a prefix-free set automatically.

The two models (prefix-free domain vs. self-delimiting reader) are equivalent. Both give the same complexity K(x) up to O(1).

Check: A self-delimiting machine determines when to stop reading by...

Reading a length header at the start Processing bits one at a time and halting when the computation is done Reaching the end of the tape

Chapter 5: The Main Theorem on Prefix Complexity

The connection between prefix complexity and a priori probability:

Main Theorem: K(x) = −log M(x) + O(1), where M is the maximal lower semicomputable semimeasure (a priori probability).

This theorem has two parts:

K(x) ≤ −log M(x) + O(1): Consider the optimal prefix machine D. Define m(x) = ∑ 2^−|p| over all p with D(p) = x. By Kraft's inequality, m is a semimeasure. Its value m(x) = 2^−K(x) (up to a constant factor). Since M dominates all semicomputable semimeasures, M(x) ≥ c · m(x) = c · 2^−K(x), giving K(x) ≥ −log M(x) − O(1).

K(x) ≥ −log M(x) − O(1): Given the semimeasure M, construct a prefix machine whose description lengths approximate −log M(x). This uses the Kraft inequality in reverse: given a semimeasure (weights summing to at most 1), construct a prefix-free code with matching lengths.

Consequence: 2^−K(x) is (up to a multiplicative constant) the "probability of x under the universal prior." This makes K(x) the natural complexity for Bayesian reasoning and connects complexity theory to Solomonoff induction.

Check: The main theorem says K(x) equals...

−log M(x) + O(1), linking complexity to a priori probability C(x) + log C(x) The length of x

Chapter 6: Properties of K

Prefix complexity K(x) shares many properties with plain complexity C(x) but has important differences:

Property	C(x)	K(x)
Upper bound	C(x) ≤ \|x\| + O(1)	K(x) ≤ \|x\| + 2 log\|x\| + O(1)
Non-growth	C(f(x)) ≤ C(x) + O(1)	K(f(x)) ≤ K(x) + O(1)
Counting	\|{x: C(x) < n}\| < 2ⁿ	\|{x: K(x) < n}\| < 2ⁿ
Kraft	∑ 2^−C(x) can be ∞	∑ 2^−K(x) ≤ 1
Relation	C(x) ≤ K(x) ≤ C(x) + 2 log C(x) + O(1)

The key difference: K(x) ≤ |x| + 2 log|x| + O(1), not |x| + O(1). The extra 2 log|x| is the cost of self-delimiting: the machine must know when the input ends, and encoding the length costs log|x| + O(1) bits.

The subadditivity property is cleaner for K: K(x, y) ≤ K(x) + K(y) + O(1). No logarithmic overhead! This is because prefix descriptions are self-delimiting — you can concatenate them without a separator.

Chain rule for K: K(x, y) = K(x) + K(y|x*) + O(1), where x* is a shortest prefix description of x. The notation K(y|x*) means "conditional on a shortest description of x" (not on x itself). This version has O(1) error, not O(log n)!

Check: The relation between K and C is...

C(x) ≤ K(x) ≤ C(x) + 2 log C(x) + O(1) K(x) = C(x) exactly K(x) ≤ C(x) always

Chapter 7: K vs C — When Does It Matter?

For most applications, C(x) and K(x) give the same results up to logarithmic terms. But the distinction matters crucially for randomness.

Why K for randomness? The characterization of ML-randomness requires K, not C. A sequence ω is ML-random iff K(ω[0..n)) ≥ n − O(1) for all n. If you try to use C instead, you get: C(ω[0..n)) ≥ n − O(log n), which is weaker and doesn't characterize ML-randomness precisely.

The reason K works better: Kraft's inequality ∑ 2^−K(x) ≤ 1 means that 2^−K(x) defines a semimeasure. This lets us use K as a test for randomness against arbitrary computable measures. Plain complexity C doesn't have this property.

In the incompressibility method (proofs about combinatorics), C is usually sufficient because the logarithmic difference between C and K is absorbed into the O(log n) error terms.

Check: Why is K(x), not C(x), needed for the randomness characterization?

Because K satisfies Kraft's inequality, connecting it to probability measures Because K is computable Because C is always larger than K

Chapter 8: Prefix Tree Visualizer

The simulation below visualizes prefix-free codes as leaves of a binary tree. Each leaf represents a valid description. The prefix-free constraint means no leaf is an ancestor of another — once you reach a codeword, you stop.

Prefix-Free Code Tree

Click nodes to toggle them as codewords (leaves). The tree enforces the prefix-free constraint: selecting a node blocks its descendants. The Kraft sum is displayed below.

Kraft sum: 0

What to notice: The balanced code (all leaves at the same depth) has codewords of equal length. The unbalanced code trades short codewords for some strings against longer ones for others — this is optimal when some strings are more probable. Kraft's inequality (∑ 2^−|p| ≤ 1) is satisfied whenever no codeword is a prefix of another.

Check: If a prefix-free code has codewords of lengths 1, 2, 2, is Kraft's inequality satisfied?

Yes: 2⁻¹ + 2⁻² + 2⁻² = 1 ≤ 1 No: the sum exceeds 1 Only if the codewords are binary

Chapter 9: Summary

Concept	Definition/Result
Semimeasure	m: Σ* → [0,∞) with ∑ m(x) ≤ 1
A priori probability M	Maximal lower semicomputable semimeasure
Prefix machine	Computable D with prefix-free domain
K(x)	Prefix complexity: shortest prefix-free description
Main theorem	K(x) = −log M(x) + O(1)
Kraft	∑ 2^−K(x) ≤ 1
K vs C	C(x) ≤ K(x) ≤ C(x) + 2 log C(x) + O(1)
Subadditivity	K(x,y) ≤ K(x) + K(y) + O(1)

Looking ahead: Chapter 5 introduces monotone complexity and proves the Levin-Schnorr theorem — the precise characterization of ML-randomness in terms of prefix complexity: ω is ML-random iff K(ω[0..n)) ≥ n − O(1) for all n. It also introduces Ω, Chaitin's halting probability — a single real number that encodes the halting problem.

Final check: The a priori probability M(x) is large when x is...

Simple (low K(x)) Long Random

A Priori Probability &Prefix Complexity