Introduction: What Is This Book About?

Chapter 0: The Question

Consider two binary strings, each 1000 bits long:

String A: 000000000000000000000000...0000 (all zeros)

String B: 011010110001110100101001...1101 (looks random)

Which one contains more information? Intuitively, String A is boring — you can describe it in a few words: "one thousand zeros." But String B? There seems to be no shortcut. To communicate it, you might have to spell out all 1000 bits.

This intuition is the seed of Kolmogorov complexity. The idea is simple and profound: the information content of a string is the length of its shortest description. A string that can be compressed a lot has low complexity. A string that cannot be compressed at all — where the shortest description is essentially the string itself — has high complexity.

The core idea: Kolmogorov complexity measures how much information a string contains by asking: what is the shortest program that produces this string? If the program is much shorter than the string, the string is compressible — it has patterns, structure, redundancy. If no program is shorter than the string itself, the string is incompressible — it is maximally complex, and in a precise sense, random.

This idea was independently developed in the 1960s by Ray Solomonoff (1960), Andrei Kolmogorov (1965), and Gregory Chaitin (1966). Kolmogorov's name stuck because his formulation was the most mathematically precise and because he connected it to the deepest questions in probability and randomness.

Before we can make "shortest program" precise, we need to fix a programming language — after all, the length of a program depends on the language. This is the first problem we'll solve.

Check: A string of 10,000 zeros has Kolmogorov complexity roughly...

10,000 bits (you must list every bit) Much less than 10,000 (a short program can print it) Exactly zero (it contains no information)

Chapter 1: Description Modes

To talk about "descriptions" formally, we introduce a description mode (also called a decompressor). This is simply a function D that takes a binary string as input and produces another binary string as output. If D(y) = x, we say that y is a description of x with respect to D.

Think of D as an interpreter. You hand it a compressed representation y, and it expands it into the original string x. A zip program is one description mode. A Python interpreter is another. Each one defines its own notion of "description."

C_D(x) = min{ |y| : D(y) = x }

The complexity of x with respect to D is the length of the shortest y that D maps to x. If no such y exists (D never outputs x), we say the complexity is infinite.

Different description modes give different complexities. Under one mode, the string "00000000" might have a very short description; under another, it might not. This seems like a fatal flaw — how can we define the complexity of a string if it depends on the language?

Kolmogorov's insight was that this dependence is only superficial. While absolute complexity values differ, they differ by at most a constant — a constant that depends on the two languages but not on the string. Let's see why.

Key insight: A description mode D₁ is "not worse" than D₂ if there is a constant c such that C_D₁(x) ≤ C_D₂(x) + c for all x. The constant c is the overhead of "translating" from one language to another.

Suppose D₁ is a Python interpreter and D₂ is a C interpreter. Given any C program y that outputs x, we can write a fixed Python program P (a C-to-Python translator) of some fixed length |P|. Then the Python description of x is at most |y| + |P| bits long. The extra |P| bits are the translation overhead — constant across all strings x.

Check: If D₁ is not worse than D₂, what does that mean?

C_D₁(x) ≤ C_D₂(x) + c for some constant c, for all x D₁ runs faster than D₂ D₁ and D₂ produce the same outputs

Chapter 2: The Optimal Description Mode

Here is the crucial theorem, due independently to Solomonoff and Kolmogorov:

Solomonoff-Kolmogorov Theorem: There exists an optimal description mode D₀ — one that is not worse than every other description mode. That is, for any description mode D, there is a constant c_D such that C_D₀(x) ≤ C_D(x) + c_D for all x.

How can one description mode beat them all? The trick is to use a universal programming language. Let U be an interpreter for a universal language. Given a pair (p, y), U runs program p on input y and returns the output. Now define:

D₀(⟨p, y⟩) = U(p, y)

Here ⟨p, y⟩ is some encoding that lets us recover both p and y from a single binary string. For any other description mode D', there is a program p_D' that implements D'. So any description y of x under D' gives a description ⟨p_D', y⟩ under D₀. The overhead is |p_D'| plus the encoding overhead — a constant that depends on D' but not on x.

Think of D₀ as a meta-interpreter. You prefix any description with the name of its language, and D₀ can handle it. This makes D₀ at least as powerful as any single language, up to a constant.

The analogy with self-extracting archives is exact: a self-extracting archive contains a small decompressor program followed by the compressed data. The decompressor is the "language identifier," and the data is the description. The total size is the data length plus a fixed overhead for the decompressor.

Any description mode D'

Compresses x using its own language

↓ prefix with program for D'

Universal mode D₀

⟨p_D', y⟩ — self-extracting archive

↓ overhead is |p_D'| = constant

Guarantee

C_D₀(x) ≤ C_D'(x) + c

Check: Why does the optimal description mode work for all description modes at once?

Because it runs faster than all of them Because it can simulate any mode by prefixing with the mode's program Because it uses a more efficient compression algorithm

Chapter 3: Defining Kolmogorov Complexity

Fix an optimal description mode D. The Kolmogorov complexity of a string x is:

C(x) = C_D(x) = min{ |y| : D(y) = x }

We drop the subscript D because the choice of optimal mode doesn't matter — switching to a different optimal mode changes C(x) by at most a constant. All statements about Kolmogorov complexity are therefore understood "up to an additive O(1)."

Two immediate consequences:

Theorem (Upper bound): C(x) ≤ |x| + O(1). The identity function x ↦ x is a valid description mode, so the shortest description is never longer than the string itself (up to a constant).

Theorem (Non-growth): For any computable function A, C(A(x)) ≤ C(x) + O(1). Applying a computable transformation cannot increase complexity by more than a constant. The constant depends on A but not on x.

The non-growth theorem is powerful. It says that complexity is robust: any computable manipulation — reversing, sorting, encrypting, hashing — can only decrease complexity (up to O(1)). If a string has low complexity, anything you compute from it also has low complexity.

A crucial caveat: the O(1) constant hides the dependence on the optimal description mode. You can talk about C(x) for classes of strings and make asymptotic statements, but asking "what is C(x) for this specific string?" doesn't have a unique answer — it depends on the choice of D.

Statement	Formal meaning
C(x) ≤ n + O(1)	There exists a constant c (depending on D) such that C(x) ≤ n + c for all x
C(x) = n + O(1)	\|C(x) − n\| ≤ c for some constant c
C(A(x)) ≤ C(x) + O(1)	The constant depends on A but not on x

Check: Can applying a computable function to a string increase its Kolmogorov complexity?

No, the increase is bounded by a constant that depends only on the function Yes, encryption can make any string maximally complex Only if the function is not injective

Chapter 4: Complexity and Information

The Kolmogorov complexity C(x) measures the amount of information in x. A string of zeros has little information — a short program generates it. A "random-looking" string that resists compression has a lot of information, even if that information is meaningless gibberish.

This is a departure from Shannon's information theory. Shannon measures the information in a message drawn from a known distribution. Kolmogorov measures the information in a single, individual string, with no probabilistic assumptions.

Key distinction: Shannon entropy tells you: "On average, messages from this source need H bits." Kolmogorov complexity tells you: "This specific string x needs C(x) bits to describe." Shannon is about ensembles; Kolmogorov is about individuals.

How many strings of a given length are "simple" (have low complexity)? The counting argument gives a clean answer:

Counting argument: There are fewer than 2ⁿ binary strings of length less than n. So at most 2ⁿ strings can have descriptions shorter than n bits. This means: among all strings of length n, at most a fraction 2ⁿ/2ⁿ = 2ⁿ⁻ⁿ can have complexity less than n. For any k, at most a fraction 2^−k of n-bit strings have C(x) < n − k.

In other words, most strings are incompressible. At least half of all n-bit strings have C(x) ≥ n. At least 99% have C(x) ≥ n − 7. Compressible strings are rare — they are the ones with patterns, structure, and regularity.

This is the foundation of the connection between complexity and randomness. An incompressible string has no patterns that a program could exploit — it "looks random." We will formalize this in Chapter 3 of the book (Martin-Löf randomness), where we show that a sequence is random if and only if its prefixes are incompressible.

Incompressibility Landscape

Each bar represents binary strings of a given length n. The shaded region shows what fraction can be compressed by at least k bits. Drag the slider to change k.

k (compression)3

Check: Among all binary strings of length 100, at most what fraction has C(x) < 90?

At most 2⁻¹⁰ ≈ 1/1024 At most 1/2 All of them

Chapter 5: Non-Computability and Berry's Paradox

Here is perhaps the most striking fact about Kolmogorov complexity: C(x) is not computable. No algorithm can take a string x as input and return C(x). Let's see why.

Suppose, for contradiction, that we had a program COMPLEXITY that computes C(x). Then we could write another program:

pseudocode
def berry(n):
    for x in all_binary_strings_by_length():
        if COMPLEXITY(x) >= n:
            return x

This program finds the first string whose complexity is at least n, and returns it. The program itself has a fixed size plus O(log n) bits to encode the parameter n. So the output x has a description of length O(log n). But we chose x so that C(x) ≥ n. For large enough n, we get a contradiction: C(x) ≥ n but also C(x) ≤ O(log n).

Berry's Paradox (formalized): "The smallest number not describable in fewer than twenty words" — but this sentence itself is a description in fewer than twenty words. The paradox is real: attempting to compute C(x) leads to a genuine mathematical contradiction. Therefore C(x) is not computable.

C(x) is, however, upper semicomputable (or "enumerable from above"). You can write a program that produces a sequence of upper bounds on C(x) that converge from above to the true value. The idea: run all possible programs in parallel. Each time one halts and produces x, you have a new upper bound on C(x). Over time, you find shorter and shorter descriptions. You just never know when you've found the shortest.

This asymmetry is fundamental: you can verify that a string has low complexity (by exhibiting a short description), but you cannot prove in general that a string has high complexity. In fact, no consistent formal system can prove "C(x) ≥ n" for n larger than the system's own complexity.

Check: Why is Kolmogorov complexity not computable?

Because strings can be infinitely long Because if it were, we could build a short program producing a string of high complexity — a contradiction Because there are uncountably many description modes

Chapter 6: First Applications

Despite (or because of) its non-computability, Kolmogorov complexity is a powerful proof tool. The incompressibility method works like this: take an object (a string, a graph, a number), assume it is incompressible, and derive consequences. Since most objects are incompressible, the conclusions hold for most objects.

The incompressibility method: To prove that "most n-bit strings have property P," show that any string without property P can be described in fewer than n bits. Since at most 2ⁿ⁻¹ strings have descriptions shorter than n bits, most strings must have property P.

Application 1: Infinitely many primes. Suppose there were only k primes p₁, ..., p_k. Then every number n ≤ 2^m can be written as p₁^a₁ · p₂^a₂ ·...· p_k^a_k, where each exponent a_i ≤ m. So n is described by the tuple (a₁, ..., a_k), which takes at most k · log m bits. But some n requires m bits to describe. For large m, k · log m < m, a contradiction.

Application 2: Strings with no short period. A string x has period p if x[i] = x[i+p] for all valid i. If x has period p, then x is determined by its first p characters plus n and p — roughly p + 2 log n bits. An incompressible string of length n has no period shorter than about n − 2 log n.

Compression Experiment

Generate random or patterned strings and see how much they compress. The bar chart compares original length vs. description length.

Check: What does the incompressibility method let you prove?

That most objects have a given property, by showing objects without it are compressible That all objects can be compressed That C(x) is computable for some strings

Chapter 7: Summary

Let's consolidate the big ideas from this introduction before diving into the technical chapters.

Concept	What it means
Description mode	A computable function D: binary strings → binary strings. D(y) = x means y describes x.
C_D(x)	Length of the shortest y such that D(y) = x.
Optimal mode	A universal D₀ that is not worse than any other mode. Exists by the Solomonoff-Kolmogorov theorem.
C(x)	Kolmogorov complexity: C_D₀(x), defined up to O(1).
C(x) ≤ \|x\| + O(1)	Complexity never exceeds string length (much).
Non-growth	C(A(x)) ≤ C(x) + O(1) for any computable A.
Counting	Fewer than 2ⁿ strings have C(x) < n. Most strings are incompressible.
Non-computability	C(x) cannot be computed, but can be approximated from above.

The road ahead: Chapter 1 formalizes plain complexity and explores its algorithmic properties. Chapter 2 extends to pairs and conditional complexity. Chapter 3 connects complexity to randomness via Martin-Löf's definition. Chapters 4–5 introduce prefix and monotone complexity — refined versions needed for the deepest theorems. And Chapter 7 bridges back to Shannon entropy.

The essential message: Kolmogorov complexity is a universal measure of information content for individual strings. It is independent of the programming language (up to O(1)), non-computable, and intimately connected to randomness. Every chapter ahead builds on these foundations.

Final check: Two different optimal description modes D₁ and D₂ give complexities that differ by...

At most a constant (depending on D₁ and D₂, not on x) At most O(log |x|) An arbitrary amount

What Is Kolmogorov Complexity?