The best model is the one that compresses your data the most. Not the one that fits it best — the one that describes it most concisely.
You have 20 data points from some unknown process. You fit a degree-1 polynomial (a line). It misses some points. You fit a degree-5 polynomial. It passes closer to the data. You fit a degree-19 polynomial. It passes through every single point perfectly. Zero error.
Which model should you trust?
Every working scientist knows the answer intuitively: the degree-19 polynomial is nonsense. It memorized the data rather than learning the pattern. It will make terrible predictions on new data. But why? The degree-19 model has the lowest error. By any "fit the data" criterion, it wins. What are we really doing when we reject it?
This is Occam's razor made operational. We all agree that simpler explanations are preferable. But "simpler" is vague. Simpler in what sense? Fewer parameters? Smoother curves? Shorter equations? MDL replaces this vague notion with a concrete, measurable one: the best model is the one that allows the shortest description of the data.
Here is the intuition. If you have found genuine regularity in the data, you can exploit that regularity to describe the data more concisely than just listing every data point. A good model is a good compressor. The best model is the best compressor.
Click Generate Data for noisy samples from a true degree-2 curve. Then toggle polynomial degree and watch the fit and description length. The minimum description length picks the right model.
In the simulation above, notice the pattern: as the degree increases from 1, the description length first decreases (the model captures real structure, so it compresses better), then increases (extra parameters cost bits to describe, but the fit gains are just memorizing noise). The minimum is at or near the true degree.
Here is the fundamental insight that MDL is built on, stated plainly:
Consider a binary sequence: 0101010101010101. That is 16 bits. But you can describe it much more concisely: "repeat 01 eight times." That short description works because there is a regularity — the alternating pattern. If you can describe the data in fewer bits than the raw data takes, you have found structure.
Now consider: 1001110100010110. Can you compress this? Not really. There is no obvious pattern to exploit. The shortest description is essentially the sequence itself. MDL would say: there is no learnable regularity here. The data is incompressible.
This maps directly to model selection. Suppose you have data D and two candidate models:
| Model M1 (simple) | Model M2 (complex) | |
|---|---|---|
| Cost to describe model | Short (few parameters) | Long (many parameters) |
| Cost to describe data given model | Moderate (imperfect fit) | Short (excellent fit) |
| Total description length | L(M1) + L(D|M1) | L(M2) + L(D|M2) |
MDL picks the model that minimizes the total description length: the cost of describing the model itself, plus the cost of describing the data given the model. A complex model fits the data well (low L(D|M)), but costs a lot to describe (high L(M)). A simple model is cheap to describe but may not fit well. The optimum balances both.
This is not just a metaphor. Grünwald shows that the connection between compression and learning can be made mathematically precise through coding theory. Every probability distribution defines a code. Every code defines a probability distribution. The length of the shortest code for data D under model M is exactly −log P(D|M). We will derive this correspondence in Chapter 5.
If compression measures learning, then the ideal measure of complexity would be the length of the shortest possible description of the data. This ideal exists. It is called Kolmogorov complexity.
Where |p| is the length of program p, and U is a universal Turing machine. In words: K(D) is the length of the shortest computer program that outputs the data D and then halts.
This is beautiful in theory. The string 0101...01 (1000 repetitions) has low Kolmogorov complexity — a short loop generates it. A truly random string of the same length has high Kolmogorov complexity — no program shorter than the string itself can generate it.
There is one fatal problem: Kolmogorov complexity is uncomputable. There is no algorithm that takes a string and returns its Kolmogorov complexity. This follows from the halting problem — you would need to enumerate all programs and check which ones halt and produce D, which is undecidable.
Moreover, K(D) depends on the choice of universal Turing machine U. Different machines give different values, though they differ by at most a constant (the invariance theorem):
This constant c does not depend on D, so for long data sequences the choice of U becomes irrelevant. But for finite data of practical size, the constant matters and cannot be computed.
| Property | Kolmogorov Complexity |
|---|---|
| What it measures | Absolute shortest description of data |
| Advantages | Ultimate, model-free complexity measure |
| Disadvantage | Uncomputable (provably) |
| Invariance | Up to a constant, independent of programming language |
| Role in MDL | Theoretical ideal that motivates practical MDL |
So Kolmogorov complexity tells us what the perfect answer looks like, but we can never actually compute it. This forces us to restrict our codes to something practical. Instead of "all programs," we consider specific model classes with well-defined codes. This is practical MDL.
Here is practical MDL in its simplest form. To describe the data D, we send two messages:
Think of it as a sender-receiver problem. Alice has data D and wants to send it to Bob using as few bits as possible. She can first send a model (a compression scheme), and then send the data compressed using that model. Bob receives both parts and reconstructs D perfectly.
Concrete example. Alice has the sequence: 7, 14, 21, 28, 35, 42, 49, 56.
Strategy A (no model): Just send all 8 numbers. Each needs, say, 6 bits. Total: 48 bits.
Strategy B (linear model): Send the model "y = 7x" (the rule), which costs maybe 10 bits. Then send the residuals (all zero), which costs 0 bits. Total: 10 bits.
Strategy C (quadratic model): Send "y = ax² + bx + c" with fitted coefficients (more bits for the model), plus residuals. The quadratic fits perfectly too, but the model description is longer. Total: maybe 18 bits.
Strategy B wins. The linear model captures all the regularity, so the residuals are zero. The quadratic wastes bits on unnecessary model complexity.
For model selection among a countable set of models H1, H2, ..., we need a code for models. A natural choice: assign shorter codes to simpler models. If we have k models, we could use a uniform code (log k bits each), but Grünwald recommends codes that give shorter descriptions to simpler models — embodying our prior preference for simplicity.
For the data given the model, we use the model's own probability distribution. If model H assigns probability P(D|H) to the data, the code length is:
This is the Shannon-Fano code. A high-probability event gets a short code. A low-probability event gets a long code. We will derive why −log P is the right code length in Chapter 5.
The simplest version of practical MDL is what Grünwald calls crude MDL (also known as "two-part code MDL"). It works like this:
Given models M1, M2, ..., MK, each a parametric family, find the model Mk that minimizes:
Where θ̂k is the maximum likelihood estimate within model Mk, and L(Mk) is the cost of describing the model.
The first term, −log P(D|θ̂k), is the negative log-likelihood at the best-fit parameters. This measures lack-of-fit. Complex models drive this term down because they can fit the data closely.
The second term, L(Mk), is the model cost. This penalizes complexity. For a model with d parameters, each specified to some precision, this term grows with d.
Here is a concrete worked example. Suppose we observe n = 100 data points and consider polynomial regression of degree d. The model has d+1 parameters (coefficients). Under crude MDL with the BIC-style code:
| Degree d | −log P(D|θ̂) | L(M) = (d+1)/2 · log n | Total |
|---|---|---|---|
| 1 | 150.0 | 1.0 × 3.32 = 3.3 | 153.3 |
| 2 | 110.0 | 1.5 × 3.32 = 5.0 | 115.0 ← winner |
| 3 | 108.5 | 2.0 × 3.32 = 6.6 | 115.1 |
| 5 | 107.0 | 3.0 × 3.32 = 10.0 | 117.0 |
| 10 | 106.2 | 5.5 × 3.32 = 18.3 | 124.5 |
Degree 2 wins. Beyond degree 2, each additional parameter reduces the log-likelihood by less than it costs to describe. The regularity has been captured; additional complexity is just fitting noise.
Despite this weakness, crude MDL is widely used and often works well in practice. BIC, one of the most popular model selection criteria in statistics, is essentially crude MDL. But when model selection matters and you need a principled answer, you need refined MDL.
We have been talking about "code lengths" and claiming that −log P is the right code length for an event with probability P. It is time to make this precise. This requires a brief detour into information theory — specifically, prefix codes.
A code maps each possible data value to a binary string (a codeword). For example, for the alphabet {a, b, c, d}:
| Symbol | Codeword | Length |
|---|---|---|
| a | 0 | 1 |
| b | 10 | 2 |
| c | 110 | 3 |
| d | 111 | 3 |
This is a prefix code (also called a prefix-free code): no codeword is a prefix of any other. This matters because it means we can decode a stream of bits unambiguously without needing separators. When we see 010110111, we can parse it uniquely as 0|10|110|111 = a, b, c, d.
Not every set of code lengths is achievable. The Kraft inequality tells us exactly which lengths are possible:
A set of code lengths L(x) can be realized as a prefix code if and only if they satisfy this inequality. This is a fundamental constraint: you cannot make all codewords short. If some codewords are short (frequent symbols), others must be long (rare symbols). There is a conservation law at work.
This duality is the mathematical engine of MDL. When we say "the code length of data D under model M is −log P(D|M)," we are not making an analogy. We are stating a precise mathematical fact: the distribution P(D|M) literally defines a prefix code, and the length of D under that code is −log P(D|M).
Shannon's source coding theorem tells us the optimal code lengths. If data is drawn from distribution P, the expected code length is minimized by the code with lengths L(x) = −log P(x). The minimum expected code length is the Shannon entropy:
No code can achieve shorter expected code length than H(P). This is why entropy is the fundamental limit of compression, and why −log P is the natural code length function.
Here is the key problem with crude MDL: it uses a two-part code where Part 1 encodes the model and Part 2 encodes the data using the best-fit parameters of that model. But the "best-fit parameters" are computed from the data. So Part 2 is not really a valid code — the decoder would need to know the data to know which parameters were used!
Crude MDL works around this by encoding the parameters explicitly in Part 1. But this introduces the arbitrary precision problem. Refined MDL takes a completely different approach: use a universal code that works well for any parameter value, without needing to specify which one.
A universal code relative to a model M = {Pθ : θ ∈ Θ} is a code with lengths L̄(D) such that for every θ, the code performs nearly as well as the optimal code for Pθ:
where Rn(θ) is a regret term that grows slowly with n (ideally sub-linearly). In words: the universal code's length is at most Rn bits longer than the ideal code for any specific θ in the model. The code adapts to whichever distribution generated the data, with only a small overhead.
There are several ways to construct universal codes. Two important ones:
1. Bayesian mixture code. Put a prior w(θ) over the parameters and use the marginal likelihood:
The code length is −log P̄Bayes(D). This is exactly the Bayesian marginal likelihood. It is a valid prefix code, and it achieves regret (d/2) log n + O(1) for d-dimensional parametric models.
2. Normalized Maximum Likelihood (NML). This is the code that Grünwald argues is the most principled. We define it in the next chapter.
This is the centerpiece of Grünwald's paper. Refined MDL replaces the crude two-part code with a single universal code based on the Normalized Maximum Likelihood (NML) distribution. Here is how it works.
Given a model M = {Pθ : θ ∈ Θ}, the NML distribution is:
Where θ̂(D) is the maximum likelihood estimate for data D, and the sum in the denominator runs over all possible data sequences D' of the same length n.
Let's parse this carefully. The numerator is just the maximized likelihood — the probability of the data under the best-fitting parameter. The denominator is a normalizing constant that ensures ∑D P̄NML(D|M) = 1, making this a proper probability distribution. This normalizing constant is called the parametric complexity:
The stochastic complexity of data D relative to model M is the NML code length:
Why is the parametric complexity the right complexity measure? Because it counts, in a precise information-theoretic sense, how many different data patterns the model can "explain well." A model that can fit anything explains nothing. The more patterns a model can fit, the higher COMPn(M), and the more it is penalized.
For regular parametric models with d parameters and n data points, the parametric complexity has an elegant asymptotic form:
Where I(θ) is the Fisher information matrix. The dominant term (d/2) log n matches BIC. But there is a correction term involving the Fisher information — a geometric property of the model that measures how "spread out" the model's distributions are in parameter space. This correction can matter in practice.
Generate noisy data from a true polynomial, then compare crude MDL (BIC-style) and refined MDL (NML-based) model selection. The minimum description length picks the model. Toggle between the two criteria.
MDL is not the only model selection criterion. How does it compare to the alternatives? Grünwald carefully distinguishes MDL from several related approaches.
MDL vs Maximum Likelihood (ML): ML picks the model that maximizes P(D|θ̂). It has no complexity penalty, so it always prefers the most complex model. MDL adds a complexity penalty that prevents overfitting. ML is a special case of MDL where you ignore the model description cost.
MDL vs AIC (Akaike Information Criterion):
AIC penalizes each parameter by exactly 1 nat (about 1.44 bits). This penalty does not grow with sample size n. As a result, AIC is not consistent: even with infinite data, it may select an overly complex model. MDL's penalty grows as (d/2) log n, so it is consistent — with enough data, it picks the true model.
MDL vs BIC (Bayesian Information Criterion):
BIC is numerically close to crude MDL and shares its (d/2) log n penalty. But BIC is derived from a Bayesian argument (Laplace approximation to the marginal likelihood), while MDL is derived from coding theory. They often agree, but refined MDL's COMPn includes the Fisher information correction that BIC lacks, and this can matter for small samples or models with non-uniform Fisher information.
MDL vs Bayesian Model Selection:
Bayesian model selection computes the posterior P(M|D) ∝ P(D|M) P(M), where P(D|M) = ∫ P(D|θ) π(θ) dθ is the marginal likelihood. This is closely related to the Bayesian mixture universal code. The key philosophical difference: Bayesian methods assume one of the models is "true." MDL does not. MDL works even when all models are wrong — it still picks the one that compresses best.
| Criterion | Penalty | Consistent? | Requires prior? |
|---|---|---|---|
| ML | None | No (overfits) | No |
| AIC | d | No | No |
| BIC | (d/2) log n | Yes | No |
| Crude MDL | ≈ (d/2) log n | Yes | No (but needs model code) |
| Refined MDL (NML) | COMPn(M) | Yes | No |
| Bayesian | ≈ (d/2) log n + ... | Yes | Yes (prior π(θ)) |
MDL vs Cross-Validation: Cross-validation estimates predictive performance by holding out data. It is model-free and makes no distributional assumptions. MDL and cross-validation often agree, but MDL is more efficient (uses all data for fitting) and has a cleaner theoretical foundation. Cross-validation can be inconsistent in some settings where MDL is consistent.
MDL sits at a crossroads of several deep ideas in mathematics, computer science, and philosophy. Let's trace the connections.
MDL and Occam's Razor. Occam's razor says: prefer the simpler explanation. But it is often misunderstood. MDL does not say "always pick the simplest model." It says: pick the model that achieves the shortest total description length. Sometimes the more complex model wins, if the data has enough structure to justify the extra complexity. MDL is a quantitative Occam's razor that tells you exactly when complexity pays.
MDL and Kolmogorov Complexity. Ideal MDL (using Kolmogorov complexity) is the theoretical gold standard. Practical MDL approximates it by restricting to computable model classes. As model classes grow richer, practical MDL approaches ideal MDL. The theory of Kolmogorov complexity provides the philosophical foundation; practical MDL provides the usable algorithms.
MDL and PAC Learning. In computational learning theory, PAC (Probably Approximately Correct) bounds also favor simpler hypotheses. The Occam bound says: if a simple hypothesis fits the training data, it will generalize. MDL's two-part code criterion is closely related — a short description length implies good generalization, similar to how a small hypothesis class implies a tight PAC bound.
MDL and Prequential (Predictive) Coding. Instead of two-part codes, one can use prequential codes: predict each data point using only the previous data points. The total code length is ∑i −log P(xi | x1, ..., xi-1). This is equivalent to using a Bayesian mixture code and connects MDL to online learning and sequential prediction.
MDL and Neural Networks. Modern deep learning implicitly uses MDL-like ideas. Weight pruning, quantization, and knowledge distillation all compress the model. The lottery ticket hypothesis says that large networks contain small subnetworks that perform equally well — the description length of the subnetwork is much shorter. Kolmogorov complexity of trained weights has been proposed as a measure of what neural networks have learned.
Paper impact. This 2004 tutorial became one of the most-cited introductions to MDL. It influenced work in Bayesian nonparametrics, neural network compression, and automated machine learning (AutoML). The MDL principle remains relevant today: every time you prune a model, distill knowledge, or penalize complexity, you are, in some sense, doing MDL.