Shen, Uspensky & Vereshchagin — Chapter 4

A Priori Probability &
Prefix Complexity

Prefix-free codes, semimeasures, and the refined complexity measure K(x) that connects to randomness and probability.

Prerequisites: Chapters 1–3 (plain complexity, conditional complexity, Martin-Löf randomness).
10
Chapters
2
Simulations
10
Quizzes

Chapter 0: Why Prefix Complexity?

Plain complexity C(x) has a flaw: it doesn't behave well with probability. We'd like to say that "the probability that a random program outputs x" is related to 2−C(x). But this fails because plain descriptions can overlap — a program of length 3 that outputs "hello" and a program of length 5 that outputs "hello" aren't related in any nice probabilistic way.

The fix is to require descriptions to be prefix-free: no description is a prefix of another. This is exactly the constraint that makes Kraft's inequality work, and it connects complexity to probability.

The key constraint: In a prefix-free code, the set of valid descriptions forms an antichain in the binary tree — no codeword is a prefix of another. This means ∑ 2−|d| ≤ 1 (Kraft's inequality), which lets us interpret 2−K(x) as a probability.

With plain complexity, ∑x 2−C(x) can be infinite (many strings can share the same description length). With prefix complexity K(x), the sum ∑x 2−K(x) is always at most 1 — it defines a semimeasure on strings. This connection to probability is why prefix complexity is the "right" version for randomness.

Check: The prefix-free constraint on descriptions ensures...

Chapter 1: Semimeasures on N

A semimeasure on the natural numbers is a function m: N → [0, ∞) such that ∑n m(n) ≤ 1. It's like a probability distribution, except the total mass may be less than 1 (some probability "leaks out").

Where do semimeasures come from? Consider a randomized algorithm: a program that flips coins as it runs. If the program always halts, the probability of each output n defines a measure. If it might not halt, some probability is "lost" to non-termination — and we get a semimeasure.

Randomized algorithm → semimeasure: Feed random bits to a program. With probability m(n), the program halts and outputs n. With probability 1 − ∑ m(n), it runs forever. The function m is a lower semicomputable semimeasure — we can approximate m(n) from below by running more and more random inputs.

A semimeasure m is lower semicomputable (or enumerable from below) if there is a computable sequence m0(n) ≤ m1(n) ≤ ... converging to m(n). We can approximate each m(n) from below but never know the exact value.

Check: A semimeasure differs from a probability measure in that...

Chapter 2: Maximal Semimeasures

Just as there is an optimal description mode for complexity, there is a maximal lower semicomputable semimeasure. Call it M.

Theorem: There exists a lower semicomputable semimeasure M such that for any other lower semicomputable semimeasure m, there is a constant c with M(n) ≥ c · m(n) for all n.

Construction: Enumerate all programs p1, p2, .... Let mi be the semimeasure generated by program pi (with random input). Define M(n) = ∑i 2−i · mi(n). This is a weighted average over all lower semicomputable semimeasures. For any specific mj, M(n) ≥ 2−j · mj(n).

The function M is called the a priori probability or Solomonoff's prior. M(n) is the probability that a universal randomized algorithm outputs n. Strings with high M(n) are those that "many programs" produce — simple strings. Strings with low M(n) are rare outputs — complex strings.

The fundamental connection: −log M(x) = K(x) + O(1). The negative log of the a priori probability equals the prefix complexity. Simple strings have high prior probability; complex strings have low prior probability. This is the bridge between information and probability.
Check: The a priori probability M(x) is highest for strings that are...

Chapter 3: Prefix Machines

A prefix machine is a partial computable function D: Σ* → Σ* whose domain is prefix-free: if D(p) is defined and D(q) is defined, then neither p nor q is a prefix of the other.

Visually, think of the domain of D as a set of leaves in the infinite binary tree. Each leaf (valid input) maps to an output string. No input is an "ancestor" of another input in the tree.

Kraft's inequality for prefix machines: If D is a prefix machine, then ∑p ∈ dom(D) 2−|p| ≤ 1. This is because the domain forms an antichain in the binary tree, and the "measure" of an antichain cannot exceed 1.

The prefix complexity of x with respect to a prefix machine D is:

KD(x) = min{ |p| : D(p) = x, p ∈ dom(D) }

An optimal prefix machine exists by the same universality argument. Fix one and define K(x) = KD(x). Like C(x), K(x) is defined up to O(1) and is not computable.

Check: In a prefix machine, no valid input is...

Chapter 4: Self-Delimiting Input

There's an equivalent way to think about prefix machines: a machine that reads its input bit by bit from a one-way tape and decides on its own when to stop reading. The machine never "looks ahead" — it processes bits one at a time and halts after reading some prefix of the tape.

Analogy: A plain machine is like a file that needs a header specifying its length. A prefix (self-delimiting) machine is like a streaming protocol that knows when the message is complete. The message carries its own length information embedded in its content.

This model makes the prefix-free condition natural. If the machine halts after reading bits p, it never reads any extension of p. So the set of inputs on which the machine halts forms a prefix-free set automatically.

The two models (prefix-free domain vs. self-delimiting reader) are equivalent. Both give the same complexity K(x) up to O(1).

Check: A self-delimiting machine determines when to stop reading by...

Chapter 5: The Main Theorem on Prefix Complexity

The connection between prefix complexity and a priori probability:

Main Theorem: K(x) = −log M(x) + O(1), where M is the maximal lower semicomputable semimeasure (a priori probability).

This theorem has two parts:

K(x) ≤ −log M(x) + O(1): Consider the optimal prefix machine D. Define m(x) = ∑ 2−|p| over all p with D(p) = x. By Kraft's inequality, m is a semimeasure. Its value m(x) = 2−K(x) (up to a constant factor). Since M dominates all semicomputable semimeasures, M(x) ≥ c · m(x) = c · 2−K(x), giving K(x) ≥ −log M(x) − O(1).

K(x) ≥ −log M(x) − O(1): Given the semimeasure M, construct a prefix machine whose description lengths approximate −log M(x). This uses the Kraft inequality in reverse: given a semimeasure (weights summing to at most 1), construct a prefix-free code with matching lengths.

Consequence: 2−K(x) is (up to a multiplicative constant) the "probability of x under the universal prior." This makes K(x) the natural complexity for Bayesian reasoning and connects complexity theory to Solomonoff induction.
Check: The main theorem says K(x) equals...

Chapter 6: Properties of K

Prefix complexity K(x) shares many properties with plain complexity C(x) but has important differences:

PropertyC(x)K(x)
Upper boundC(x) ≤ |x| + O(1)K(x) ≤ |x| + 2 log|x| + O(1)
Non-growthC(f(x)) ≤ C(x) + O(1)K(f(x)) ≤ K(x) + O(1)
Counting|{x: C(x) < n}| < 2n|{x: K(x) < n}| < 2n
Kraft∑ 2−C(x) can be ∞∑ 2−K(x) ≤ 1
RelationC(x) ≤ K(x) ≤ C(x) + 2 log C(x) + O(1)

The key difference: K(x) ≤ |x| + 2 log|x| + O(1), not |x| + O(1). The extra 2 log|x| is the cost of self-delimiting: the machine must know when the input ends, and encoding the length costs log|x| + O(1) bits.

The subadditivity property is cleaner for K: K(x, y) ≤ K(x) + K(y) + O(1). No logarithmic overhead! This is because prefix descriptions are self-delimiting — you can concatenate them without a separator.

Chain rule for K: K(x, y) = K(x) + K(y|x*) + O(1), where x* is a shortest prefix description of x. The notation K(y|x*) means "conditional on a shortest description of x" (not on x itself). This version has O(1) error, not O(log n)!
Check: The relation between K and C is...

Chapter 7: K vs C — When Does It Matter?

For most applications, C(x) and K(x) give the same results up to logarithmic terms. But the distinction matters crucially for randomness.

Why K for randomness? The characterization of ML-randomness requires K, not C. A sequence ω is ML-random iff K(ω[0..n)) ≥ n − O(1) for all n. If you try to use C instead, you get: C(ω[0..n)) ≥ n − O(log n), which is weaker and doesn't characterize ML-randomness precisely.

The reason K works better: Kraft's inequality ∑ 2−K(x) ≤ 1 means that 2−K(x) defines a semimeasure. This lets us use K as a test for randomness against arbitrary computable measures. Plain complexity C doesn't have this property.

In the incompressibility method (proofs about combinatorics), C is usually sufficient because the logarithmic difference between C and K is absorbed into the O(log n) error terms.

Check: Why is K(x), not C(x), needed for the randomness characterization?

Chapter 8: Prefix Tree Visualizer

The simulation below visualizes prefix-free codes as leaves of a binary tree. Each leaf represents a valid description. The prefix-free constraint means no leaf is an ancestor of another — once you reach a codeword, you stop.

Prefix-Free Code Tree

Click nodes to toggle them as codewords (leaves). The tree enforces the prefix-free constraint: selecting a node blocks its descendants. The Kraft sum is displayed below.

Kraft sum: 0
What to notice: The balanced code (all leaves at the same depth) has codewords of equal length. The unbalanced code trades short codewords for some strings against longer ones for others — this is optimal when some strings are more probable. Kraft's inequality (∑ 2−|p| ≤ 1) is satisfied whenever no codeword is a prefix of another.
Check: If a prefix-free code has codewords of lengths 1, 2, 2, is Kraft's inequality satisfied?

Chapter 9: Summary

ConceptDefinition/Result
Semimeasurem: Σ* → [0,∞) with ∑ m(x) ≤ 1
A priori probability MMaximal lower semicomputable semimeasure
Prefix machineComputable D with prefix-free domain
K(x)Prefix complexity: shortest prefix-free description
Main theoremK(x) = −log M(x) + O(1)
Kraft∑ 2−K(x) ≤ 1
K vs CC(x) ≤ K(x) ≤ C(x) + 2 log C(x) + O(1)
SubadditivityK(x,y) ≤ K(x) + K(y) + O(1)
Looking ahead: Chapter 5 introduces monotone complexity and proves the Levin-Schnorr theorem — the precise characterization of ML-randomness in terms of prefix complexity: ω is ML-random iff K(ω[0..n)) ≥ n − O(1) for all n. It also introduces Ω, Chaitin's halting probability — a single real number that encodes the halting problem.
Final check: The a priori probability M(x) is large when x is...