The shortest program that produces a string tells you how much information the string really contains.
Consider two binary strings, each 1000 bits long:
String A: 000000000000000000000000...0000 (all zeros)
String B: 011010110001110100101001...1101 (looks random)
Which one contains more information? Intuitively, String A is boring — you can describe it in a few words: "one thousand zeros." But String B? There seems to be no shortcut. To communicate it, you might have to spell out all 1000 bits.
This intuition is the seed of Kolmogorov complexity. The idea is simple and profound: the information content of a string is the length of its shortest description. A string that can be compressed a lot has low complexity. A string that cannot be compressed at all — where the shortest description is essentially the string itself — has high complexity.
This idea was independently developed in the 1960s by Ray Solomonoff (1960), Andrei Kolmogorov (1965), and Gregory Chaitin (1966). Kolmogorov's name stuck because his formulation was the most mathematically precise and because he connected it to the deepest questions in probability and randomness.
Before we can make "shortest program" precise, we need to fix a programming language — after all, the length of a program depends on the language. This is the first problem we'll solve.
To talk about "descriptions" formally, we introduce a description mode (also called a decompressor). This is simply a function D that takes a binary string as input and produces another binary string as output. If D(y) = x, we say that y is a description of x with respect to D.
Think of D as an interpreter. You hand it a compressed representation y, and it expands it into the original string x. A zip program is one description mode. A Python interpreter is another. Each one defines its own notion of "description."
The complexity of x with respect to D is the length of the shortest y that D maps to x. If no such y exists (D never outputs x), we say the complexity is infinite.
Different description modes give different complexities. Under one mode, the string "00000000" might have a very short description; under another, it might not. This seems like a fatal flaw — how can we define the complexity of a string if it depends on the language?
Kolmogorov's insight was that this dependence is only superficial. While absolute complexity values differ, they differ by at most a constant — a constant that depends on the two languages but not on the string. Let's see why.
Suppose D1 is a Python interpreter and D2 is a C interpreter. Given any C program y that outputs x, we can write a fixed Python program P (a C-to-Python translator) of some fixed length |P|. Then the Python description of x is at most |y| + |P| bits long. The extra |P| bits are the translation overhead — constant across all strings x.
Here is the crucial theorem, due independently to Solomonoff and Kolmogorov:
How can one description mode beat them all? The trick is to use a universal programming language. Let U be an interpreter for a universal language. Given a pair (p, y), U runs program p on input y and returns the output. Now define:
Here ⟨p, y⟩ is some encoding that lets us recover both p and y from a single binary string. For any other description mode D', there is a program pD' that implements D'. So any description y of x under D' gives a description ⟨pD', y⟩ under D0. The overhead is |pD'| plus the encoding overhead — a constant that depends on D' but not on x.
Think of D0 as a meta-interpreter. You prefix any description with the name of its language, and D0 can handle it. This makes D0 at least as powerful as any single language, up to a constant.
The analogy with self-extracting archives is exact: a self-extracting archive contains a small decompressor program followed by the compressed data. The decompressor is the "language identifier," and the data is the description. The total size is the data length plus a fixed overhead for the decompressor.
Fix an optimal description mode D. The Kolmogorov complexity of a string x is:
We drop the subscript D because the choice of optimal mode doesn't matter — switching to a different optimal mode changes C(x) by at most a constant. All statements about Kolmogorov complexity are therefore understood "up to an additive O(1)."
Two immediate consequences:
The non-growth theorem is powerful. It says that complexity is robust: any computable manipulation — reversing, sorting, encrypting, hashing — can only decrease complexity (up to O(1)). If a string has low complexity, anything you compute from it also has low complexity.
A crucial caveat: the O(1) constant hides the dependence on the optimal description mode. You can talk about C(x) for classes of strings and make asymptotic statements, but asking "what is C(x) for this specific string?" doesn't have a unique answer — it depends on the choice of D.
| Statement | Formal meaning |
|---|---|
| C(x) ≤ n + O(1) | There exists a constant c (depending on D) such that C(x) ≤ n + c for all x |
| C(x) = n + O(1) | |C(x) − n| ≤ c for some constant c |
| C(A(x)) ≤ C(x) + O(1) | The constant depends on A but not on x |
The Kolmogorov complexity C(x) measures the amount of information in x. A string of zeros has little information — a short program generates it. A "random-looking" string that resists compression has a lot of information, even if that information is meaningless gibberish.
This is a departure from Shannon's information theory. Shannon measures the information in a message drawn from a known distribution. Kolmogorov measures the information in a single, individual string, with no probabilistic assumptions.
How many strings of a given length are "simple" (have low complexity)? The counting argument gives a clean answer:
In other words, most strings are incompressible. At least half of all n-bit strings have C(x) ≥ n. At least 99% have C(x) ≥ n − 7. Compressible strings are rare — they are the ones with patterns, structure, and regularity.
This is the foundation of the connection between complexity and randomness. An incompressible string has no patterns that a program could exploit — it "looks random." We will formalize this in Chapter 3 of the book (Martin-Löf randomness), where we show that a sequence is random if and only if its prefixes are incompressible.
Each bar represents binary strings of a given length n. The shaded region shows what fraction can be compressed by at least k bits. Drag the slider to change k.
Here is perhaps the most striking fact about Kolmogorov complexity: C(x) is not computable. No algorithm can take a string x as input and return C(x). Let's see why.
Suppose, for contradiction, that we had a program COMPLEXITY that computes C(x). Then we could write another program:
pseudocode def berry(n): for x in all_binary_strings_by_length(): if COMPLEXITY(x) >= n: return x
This program finds the first string whose complexity is at least n, and returns it. The program itself has a fixed size plus O(log n) bits to encode the parameter n. So the output x has a description of length O(log n). But we chose x so that C(x) ≥ n. For large enough n, we get a contradiction: C(x) ≥ n but also C(x) ≤ O(log n).
C(x) is, however, upper semicomputable (or "enumerable from above"). You can write a program that produces a sequence of upper bounds on C(x) that converge from above to the true value. The idea: run all possible programs in parallel. Each time one halts and produces x, you have a new upper bound on C(x). Over time, you find shorter and shorter descriptions. You just never know when you've found the shortest.
This asymmetry is fundamental: you can verify that a string has low complexity (by exhibiting a short description), but you cannot prove in general that a string has high complexity. In fact, no consistent formal system can prove "C(x) ≥ n" for n larger than the system's own complexity.
Despite (or because of) its non-computability, Kolmogorov complexity is a powerful proof tool. The incompressibility method works like this: take an object (a string, a graph, a number), assume it is incompressible, and derive consequences. Since most objects are incompressible, the conclusions hold for most objects.
Application 1: Infinitely many primes. Suppose there were only k primes p1, ..., pk. Then every number n ≤ 2m can be written as p1a1 · p2a2 ·...· pkak, where each exponent ai ≤ m. So n is described by the tuple (a1, ..., ak), which takes at most k · log m bits. But some n requires m bits to describe. For large m, k · log m < m, a contradiction.
Application 2: Strings with no short period. A string x has period p if x[i] = x[i+p] for all valid i. If x has period p, then x is determined by its first p characters plus n and p — roughly p + 2 log n bits. An incompressible string of length n has no period shorter than about n − 2 log n.
Generate random or patterned strings and see how much they compress. The bar chart compares original length vs. description length.
Let's consolidate the big ideas from this introduction before diving into the technical chapters.
| Concept | What it means |
|---|---|
| Description mode | A computable function D: binary strings → binary strings. D(y) = x means y describes x. |
| CD(x) | Length of the shortest y such that D(y) = x. |
| Optimal mode | A universal D0 that is not worse than any other mode. Exists by the Solomonoff-Kolmogorov theorem. |
| C(x) | Kolmogorov complexity: CD0(x), defined up to O(1). |
| C(x) ≤ |x| + O(1) | Complexity never exceeds string length (much). |
| Non-growth | C(A(x)) ≤ C(x) + O(1) for any computable A. |
| Counting | Fewer than 2n strings have C(x) < n. Most strings are incompressible. |
| Non-computability | C(x) cannot be computed, but can be approximated from above. |
The essential message: Kolmogorov complexity is a universal measure of information content for individual strings. It is independent of the programming language (up to O(1)), non-computable, and intimately connected to randomness. Every chapter ahead builds on these foundations.