MDL Principle — Veanors

Chapter 0: The Problem

You have 20 data points from some unknown process. You fit a degree-1 polynomial (a line). It misses some points. You fit a degree-5 polynomial. It passes closer to the data. You fit a degree-19 polynomial. It passes through every single point perfectly. Zero error.

Which model should you trust?

Every working scientist knows the answer intuitively: the degree-19 polynomial is nonsense. It memorized the data rather than learning the pattern. It will make terrible predictions on new data. But why? The degree-19 model has the lowest error. By any "fit the data" criterion, it wins. What are we really doing when we reject it?

The central question: Model selection is not about finding the model that fits best. It is about finding the model that captures the regularity in the data while ignoring the noise. But how do you tell regularity from noise? MDL gives a precise, principled answer: regularity is whatever lets you compress the data.

This is Occam's razor made operational. We all agree that simpler explanations are preferable. But "simpler" is vague. Simpler in what sense? Fewer parameters? Smoother curves? Shorter equations? MDL replaces this vague notion with a concrete, measurable one: the best model is the one that allows the shortest description of the data.

Here is the intuition. If you have found genuine regularity in the data, you can exploit that regularity to describe the data more concisely than just listing every data point. A good model is a good compressor. The best model is the best compressor.

Overfitting: Why Perfect Fit Is Not Perfect

Click Generate Data for noisy samples from a true degree-2 curve. Then toggle polynomial degree and watch the fit and description length. The minimum description length picks the right model.

Degree: 2

Polynomial degree2

In the simulation above, notice the pattern: as the degree increases from 1, the description length first decreases (the model captures real structure, so it compresses better), then increases (extra parameters cost bits to describe, but the fit gains are just memorizing noise). The minimum is at or near the true degree.

Why does a degree-19 polynomial that passes through every point have a LONGER description length than a degree-2 polynomial that misses some points?

Because the degree-19 polynomial is slower to compute Because the 19 coefficients cost many bits to describe, and this cost outweighs the small savings from a perfect fit Because higher-degree polynomials always have longer descriptions

Chapter 1: Compression = Learning

Here is the fundamental insight that MDL is built on, stated plainly:

The MDL Insight: Any regularity in the data can be used to compress the data. The more regularities you find, the more you can compress. Therefore, learning (finding regularity) and compression (shortening descriptions) are the same thing.

Consider a binary sequence: 0101010101010101. That is 16 bits. But you can describe it much more concisely: "repeat 01 eight times." That short description works because there is a regularity — the alternating pattern. If you can describe the data in fewer bits than the raw data takes, you have found structure.

Now consider: 1001110100010110. Can you compress this? Not really. There is no obvious pattern to exploit. The shortest description is essentially the sequence itself. MDL would say: there is no learnable regularity here. The data is incompressible.

This maps directly to model selection. Suppose you have data D and two candidate models:

	Model M₁ (simple)	Model M₂ (complex)
Cost to describe model	Short (few parameters)	Long (many parameters)
Cost to describe data given model	Moderate (imperfect fit)	Short (excellent fit)
Total description length	L(M₁) + L(D\|M₁)	L(M₂) + L(D\|M₂)

MDL picks the model that minimizes the total description length: the cost of describing the model itself, plus the cost of describing the data given the model. A complex model fits the data well (low L(D|M)), but costs a lot to describe (high L(M)). A simple model is cheap to describe but may not fit well. The optimum balances both.

Regularity in Data

"There is a pattern: y ≈ 2x + 3"

↓ enables

Compression

"Describe model (2x+3) + residuals, instead of listing all y values"

↓ implies

Learning

"The shorter the total description, the more regularity we found"

This is not just a metaphor. Grünwald shows that the connection between compression and learning can be made mathematically precise through coding theory. Every probability distribution defines a code. Every code defines a probability distribution. The length of the shortest code for data D under model M is exactly −log P(D|M). We will derive this correspondence in Chapter 5.

Why compression, not likelihood? Maximum likelihood says: pick the model that makes the data most probable. But maximum likelihood always picks the most complex model — it never penalizes complexity. MDL says: pick the model that describes the data most concisely. This automatically penalizes complexity because complex models are expensive to describe. Compression provides a built-in Occam's razor.

According to MDL, what does it mean to say "we have learned something about the data"?

We have achieved a low training error We have memorized all data points We have found a regularity that lets us compress the data shorter than its literal description

Chapter 2: Kolmogorov Complexity

If compression measures learning, then the ideal measure of complexity would be the length of the shortest possible description of the data. This ideal exists. It is called Kolmogorov complexity.

K(D) = min { |p| : U(p) = D }

Where |p| is the length of program p, and U is a universal Turing machine. In words: K(D) is the length of the shortest computer program that outputs the data D and then halts.

This is beautiful in theory. The string 0101...01 (1000 repetitions) has low Kolmogorov complexity — a short loop generates it. A truly random string of the same length has high Kolmogorov complexity — no program shorter than the string itself can generate it.

Ideal MDL: Given data D and models M₁, M₂, the ideal MDL approach would compute K(D|M₁) vs K(D|M₂) — the length of the shortest program that generates D, given the model as input. The model that allows the shorter program is the better explanation.

There is one fatal problem: Kolmogorov complexity is uncomputable. There is no algorithm that takes a string and returns its Kolmogorov complexity. This follows from the halting problem — you would need to enumerate all programs and check which ones halt and produce D, which is undecidable.

Moreover, K(D) depends on the choice of universal Turing machine U. Different machines give different values, though they differ by at most a constant (the invariance theorem):

|K_U1(D) − K_U2(D)| ≤ c_U1,U2

This constant c does not depend on D, so for long data sequences the choice of U becomes irrelevant. But for finite data of practical size, the constant matters and cannot be computed.

Property	Kolmogorov Complexity
What it measures	Absolute shortest description of data
Advantages	Ultimate, model-free complexity measure
Disadvantage	Uncomputable (provably)
Invariance	Up to a constant, independent of programming language
Role in MDL	Theoretical ideal that motivates practical MDL

So Kolmogorov complexity tells us what the perfect answer looks like, but we can never actually compute it. This forces us to restrict our codes to something practical. Instead of "all programs," we consider specific model classes with well-defined codes. This is practical MDL.

The pragmatic turn: Ideal MDL says "find the shortest program." Practical MDL says "I can't search all programs, so let me search within a well-defined family of models, using codes I can actually compute." The rest of this paper is about making that practical approach precise and powerful.

Why can't we just use Kolmogorov complexity directly for model selection?

It is uncomputable — no algorithm can determine the shortest program for arbitrary data It is too slow (exponential time) It only works for binary data

Chapter 3: Two-Part Codes

Here is practical MDL in its simplest form. To describe the data D, we send two messages:

Part 1: The Model

Encode which model (hypothesis) we picked: L(H)

Part 2: The Data Given the Model

Encode the data using the model: L(D|H)

Total Description Length

L(H) + L(D|H) — minimize this!

Think of it as a sender-receiver problem. Alice has data D and wants to send it to Bob using as few bits as possible. She can first send a model (a compression scheme), and then send the data compressed using that model. Bob receives both parts and reconstructs D perfectly.

Concrete example. Alice has the sequence: 7, 14, 21, 28, 35, 42, 49, 56.

Strategy A (no model): Just send all 8 numbers. Each needs, say, 6 bits. Total: 48 bits.

Strategy B (linear model): Send the model "y = 7x" (the rule), which costs maybe 10 bits. Then send the residuals (all zero), which costs 0 bits. Total: 10 bits.

Strategy C (quadratic model): Send "y = ax² + bx + c" with fitted coefficients (more bits for the model), plus residuals. The quadratic fits perfectly too, but the model description is longer. Total: maybe 18 bits.

Strategy B wins. The linear model captures all the regularity, so the residuals are zero. The quadratic wastes bits on unnecessary model complexity.

The tradeoff is automatic: You never have to manually decide "how complex is too complex." The code itself decides. If a more complex model saves more bits on data encoding (Part 2) than it costs in model encoding (Part 1), it wins. If not, it loses. The math does the work.

For model selection among a countable set of models H₁, H₂, ..., we need a code for models. A natural choice: assign shorter codes to simpler models. If we have k models, we could use a uniform code (log k bits each), but Grünwald recommends codes that give shorter descriptions to simpler models — embodying our prior preference for simplicity.

For the data given the model, we use the model's own probability distribution. If model H assigns probability P(D|H) to the data, the code length is:

L(D|H) = −log₂ P(D|H)

This is the Shannon-Fano code. A high-probability event gets a short code. A low-probability event gets a long code. We will derive why −log P is the right code length in Chapter 5.

In a two-part code, what are the two parts?

(1) A description of the model/hypothesis, and (2) a description of the data given that model (1) The training data and (2) the test data (1) The parameters and (2) the hyperparameters

Chapter 4: Crude MDL

The simplest version of practical MDL is what Grünwald calls crude MDL (also known as "two-part code MDL"). It works like this:

Given models M₁, M₂, ..., M_K, each a parametric family, find the model M_k that minimizes:

L_crude(M_k) = −log P(D | θ̂_k) + L(M_k)

Where θ̂_k is the maximum likelihood estimate within model M_k, and L(M_k) is the cost of describing the model.

The first term, −log P(D|θ̂_k), is the negative log-likelihood at the best-fit parameters. This measures lack-of-fit. Complex models drive this term down because they can fit the data closely.

The second term, L(M_k), is the model cost. This penalizes complexity. For a model with d parameters, each specified to some precision, this term grows with d.

Looks familiar? This is closely related to criteria you may already know. BIC (Bayesian Information Criterion) is approximately crude MDL with a specific choice of model code: L(M_k) = (d_k/2) log n, where d_k is the number of parameters and n is the sample size. AIC uses L(M_k) = d_k. Different model codes give different criteria.

Here is a concrete worked example. Suppose we observe n = 100 data points and consider polynomial regression of degree d. The model has d+1 parameters (coefficients). Under crude MDL with the BIC-style code:

Degree d	−log P(D\|θ̂)	L(M) = (d+1)/2 · log n	Total
1	150.0	1.0 × 3.32 = 3.3	153.3
2	110.0	1.5 × 3.32 = 5.0	115.0 ← winner
3	108.5	2.0 × 3.32 = 6.6	115.1
5	107.0	3.0 × 3.32 = 10.0	117.0
10	106.2	5.5 × 3.32 = 18.3	124.5

Degree 2 wins. Beyond degree 2, each additional parameter reduces the log-likelihood by less than it costs to describe. The regularity has been captured; additional complexity is just fitting noise.

The problem with crude MDL: This approach has a serious weakness. The description length depends on how you encode the parameters — to what precision, in what order, using what code. Different encoding choices give different answers. This is not a minor technical detail; it means crude MDL is not a principled, unique criterion. Grünwald calls this the "arbitrary precision" problem. Refined MDL (Chapter 7) eliminates it entirely.

Despite this weakness, crude MDL is widely used and often works well in practice. BIC, one of the most popular model selection criteria in statistics, is essentially crude MDL. But when model selection matters and you need a principled answer, you need refined MDL.

What is the main weakness of crude (two-part code) MDL?

It is computationally intractable It only works for linear models The result depends on the arbitrary choice of how to encode the parameters (precision, code format)

Chapter 5: Prefix Codes & the Kraft Inequality

We have been talking about "code lengths" and claiming that −log P is the right code length for an event with probability P. It is time to make this precise. This requires a brief detour into information theory — specifically, prefix codes.

A code maps each possible data value to a binary string (a codeword). For example, for the alphabet {a, b, c, d}:

Symbol	Codeword	Length
a	0	1
b	10	2
c	110	3
d	111	3

This is a prefix code (also called a prefix-free code): no codeword is a prefix of any other. This matters because it means we can decode a stream of bits unambiguously without needing separators. When we see 010110111, we can parse it uniquely as 0|10|110|111 = a, b, c, d.

Not every set of code lengths is achievable. The Kraft inequality tells us exactly which lengths are possible:

∑_{x ∈ X} 2^−L(x) ≤ 1

A set of code lengths L(x) can be realized as a prefix code if and only if they satisfy this inequality. This is a fundamental constraint: you cannot make all codewords short. If some codewords are short (frequent symbols), others must be long (rare symbols). There is a conservation law at work.

The probability-code duality: Notice that ∑ 2^−L(x) ≤ 1 looks exactly like ∑ P(x) ≤ 1. This is not a coincidence. Every prefix code defines a (possibly defective) probability distribution via P(x) = 2^−L(x). Conversely, every probability distribution P defines a prefix code with lengths L(x) = −log₂ P(x) (rounding up to the nearest integer). Codes and distributions are two views of the same object.

This duality is the mathematical engine of MDL. When we say "the code length of data D under model M is −log P(D|M)," we are not making an analogy. We are stating a precise mathematical fact: the distribution P(D|M) literally defines a prefix code, and the length of D under that code is −log P(D|M).

Shannon's source coding theorem tells us the optimal code lengths. If data is drawn from distribution P, the expected code length is minimized by the code with lengths L(x) = −log P(x). The minimum expected code length is the Shannon entropy:

H(P) = −∑_x P(x) log P(x) = E[−log P(X)]

No code can achieve shorter expected code length than H(P). This is why entropy is the fundamental limit of compression, and why −log P is the natural code length function.

Bits and nats: When we use log base 2, lengths are in bits. When we use natural log, lengths are in nats. MDL uses log base 2 (bits) for concrete coding interpretations, but the choice is just a scaling factor. Throughout this lesson, "log" means log₂ unless stated otherwise.

What does the Kraft inequality ∑ 2^−L(x) ≤ 1 tell us?

It constrains which sets of code lengths can be realized as a prefix code — you can't make all codewords short It tells us the maximum number of symbols we can encode It measures the entropy of a distribution

Chapter 6: Universal Codes

Here is the key problem with crude MDL: it uses a two-part code where Part 1 encodes the model and Part 2 encodes the data using the best-fit parameters of that model. But the "best-fit parameters" are computed from the data. So Part 2 is not really a valid code — the decoder would need to know the data to know which parameters were used!

Crude MDL works around this by encoding the parameters explicitly in Part 1. But this introduces the arbitrary precision problem. Refined MDL takes a completely different approach: use a universal code that works well for any parameter value, without needing to specify which one.

The breakthrough: Instead of encoding "the model" and "the data given the model" separately, use a single code for the data that works well no matter which distribution in the model generated it. This eliminates the two-part split and the arbitrary precision problem in one stroke.

A universal code relative to a model M = {P_θ : θ ∈ Θ} is a code with lengths L̄(D) such that for every θ, the code performs nearly as well as the optimal code for P_θ:

L̄(D) − (−log P_θ(D)) ≤ R_n(θ)

where R_n(θ) is a regret term that grows slowly with n (ideally sub-linearly). In words: the universal code's length is at most R_n bits longer than the ideal code for any specific θ in the model. The code adapts to whichever distribution generated the data, with only a small overhead.

There are several ways to construct universal codes. Two important ones:

1. Bayesian mixture code. Put a prior w(θ) over the parameters and use the marginal likelihood:

P̄_Bayes(D) = ∫ P_θ(D) w(θ) dθ

The code length is −log P̄_Bayes(D). This is exactly the Bayesian marginal likelihood. It is a valid prefix code, and it achieves regret (d/2) log n + O(1) for d-dimensional parametric models.

2. Normalized Maximum Likelihood (NML). This is the code that Grünwald argues is the most principled. We define it in the next chapter.

Why "universal"? The code is called universal because it works well regardless of which θ generated the data. You don't need to commit to a specific parameter value. The code automatically adapts. This is the key advantage over two-part codes: no arbitrary encoding decisions.

What is the advantage of a universal code over a two-part code?

It is always shorter It eliminates the arbitrary precision problem by not requiring explicit parameter encoding It does not require any model at all

Chapter 7: Refined MDL — The NML Code

This is the centerpiece of Grünwald's paper. Refined MDL replaces the crude two-part code with a single universal code based on the Normalized Maximum Likelihood (NML) distribution. Here is how it works.

Given a model M = {P_θ : θ ∈ Θ}, the NML distribution is:

P̄_NML(D | M) = P(D | θ̂(D)) / ∑_D' P(D' | θ̂(D'))

Where θ̂(D) is the maximum likelihood estimate for data D, and the sum in the denominator runs over all possible data sequences D' of the same length n.

Let's parse this carefully. The numerator is just the maximized likelihood — the probability of the data under the best-fitting parameter. The denominator is a normalizing constant that ensures ∑_D P̄_NML(D|M) = 1, making this a proper probability distribution. This normalizing constant is called the parametric complexity:

COMP_n(M) = log ∑_{D' ∈ Xⁿ} P(D' | θ̂(D'))

The stochastic complexity of data D relative to model M is the NML code length:

−log P̄_NML(D|M) = −log P(D|θ̂(D)) + COMP_n(M)

The refined MDL criterion: Select the model M that minimizes the stochastic complexity: −log P(D|θ̂) + COMP_n(M). The first term is lack-of-fit (same as crude MDL). The second term, COMP_n(M), is the parametric complexity — a measure of how many distinct data patterns the model can fit. This is the principled complexity penalty.

Why is the parametric complexity the right complexity measure? Because it counts, in a precise information-theoretic sense, how many different data patterns the model can "explain well." A model that can fit anything explains nothing. The more patterns a model can fit, the higher COMP_n(M), and the more it is penalized.

For regular parametric models with d parameters and n data points, the parametric complexity has an elegant asymptotic form:

COMP_n(M) ≈ (d/2) log(n / 2π) + log ∫ √det I(θ) dθ + o(1)

Where I(θ) is the Fisher information matrix. The dominant term (d/2) log n matches BIC. But there is a correction term involving the Fisher information — a geometric property of the model that measures how "spread out" the model's distributions are in parameter space. This correction can matter in practice.

Stochastic Complexity: Crude vs Refined MDL

Generate noisy data from a true polynomial, then compare crude MDL (BIC-style) and refined MDL (NML-based) model selection. The minimum description length picks the model. Toggle between the two criteria.

n=30, true degree=3

True degree3

Sample size n30

Noise level σ0.5

NML is minimax optimal. Among all universal codes for model M, the NML code achieves the smallest worst-case regret. No other code does better in the worst case. This is why Grünwald considers it the most principled universal code — it makes no arbitrary choices and is uniquely determined by the model itself.

In refined MDL, what does COMP_n(M) — the parametric complexity — measure?

The number of parameters in the model The computational cost of fitting the model How many distinct data patterns the model can fit well — its effective richness

Chapter 8: MDL vs the World

MDL is not the only model selection criterion. How does it compare to the alternatives? Grünwald carefully distinguishes MDL from several related approaches.

MDL vs Maximum Likelihood (ML): ML picks the model that maximizes P(D|θ̂). It has no complexity penalty, so it always prefers the most complex model. MDL adds a complexity penalty that prevents overfitting. ML is a special case of MDL where you ignore the model description cost.

MDL vs AIC (Akaike Information Criterion):

AIC = −log P(D|θ̂) + d

AIC penalizes each parameter by exactly 1 nat (about 1.44 bits). This penalty does not grow with sample size n. As a result, AIC is not consistent: even with infinite data, it may select an overly complex model. MDL's penalty grows as (d/2) log n, so it is consistent — with enough data, it picks the true model.

MDL vs BIC (Bayesian Information Criterion):

BIC = −log P(D|θ̂) + (d/2) log n

BIC is numerically close to crude MDL and shares its (d/2) log n penalty. But BIC is derived from a Bayesian argument (Laplace approximation to the marginal likelihood), while MDL is derived from coding theory. They often agree, but refined MDL's COMP_n includes the Fisher information correction that BIC lacks, and this can matter for small samples or models with non-uniform Fisher information.

MDL vs Bayesian Model Selection:

Bayesian model selection computes the posterior P(M|D) ∝ P(D|M) P(M), where P(D|M) = ∫ P(D|θ) π(θ) dθ is the marginal likelihood. This is closely related to the Bayesian mixture universal code. The key philosophical difference: Bayesian methods assume one of the models is "true." MDL does not. MDL works even when all models are wrong — it still picks the one that compresses best.

Criterion	Penalty	Consistent?	Requires prior?
ML	None	No (overfits)	No
AIC	d	No	No
BIC	(d/2) log n	Yes	No
Crude MDL	≈ (d/2) log n	Yes	No (but needs model code)
Refined MDL (NML)	COMP_n(M)	Yes	No
Bayesian	≈ (d/2) log n + ...	Yes	Yes (prior π(θ))

MDL's philosophical advantage: Unlike Bayesian inference, MDL does not assume any model is true. It does not ask "which model generated the data?" but rather "which model compresses the data best?" This is a weaker, more honest claim. In practice, all models are wrong. MDL acknowledges this and still gives useful answers.

MDL vs Cross-Validation: Cross-validation estimates predictive performance by holding out data. It is model-free and makes no distributional assumptions. MDL and cross-validation often agree, but MDL is more efficient (uses all data for fitting) and has a cleaner theoretical foundation. Cross-validation can be inconsistent in some settings where MDL is consistent.

When to use what: If you believe your model class contains the truth and you have a meaningful prior, use Bayesian inference. If you want prediction accuracy and don't care about model identity, use cross-validation. If you want a principled, assumption-lean model selection criterion that works even when all models are wrong, use MDL.

What is the key philosophical difference between MDL and Bayesian model selection?

MDL does not assume any model is true; it asks which model compresses best, even if all models are wrong MDL requires a prior distribution over models MDL only works for linear models

Chapter 9: Connections

MDL sits at a crossroads of several deep ideas in mathematics, computer science, and philosophy. Let's trace the connections.

MDL and Occam's Razor. Occam's razor says: prefer the simpler explanation. But it is often misunderstood. MDL does not say "always pick the simplest model." It says: pick the model that achieves the shortest total description length. Sometimes the more complex model wins, if the data has enough structure to justify the extra complexity. MDL is a quantitative Occam's razor that tells you exactly when complexity pays.

MDL and Kolmogorov Complexity. Ideal MDL (using Kolmogorov complexity) is the theoretical gold standard. Practical MDL approximates it by restricting to computable model classes. As model classes grow richer, practical MDL approaches ideal MDL. The theory of Kolmogorov complexity provides the philosophical foundation; practical MDL provides the usable algorithms.

MDL and PAC Learning. In computational learning theory, PAC (Probably Approximately Correct) bounds also favor simpler hypotheses. The Occam bound says: if a simple hypothesis fits the training data, it will generalize. MDL's two-part code criterion is closely related — a short description length implies good generalization, similar to how a small hypothesis class implies a tight PAC bound.

MDL and Prequential (Predictive) Coding. Instead of two-part codes, one can use prequential codes: predict each data point using only the previous data points. The total code length is ∑_i −log P(x_i | x₁, ..., x_i-1). This is equivalent to using a Bayesian mixture code and connects MDL to online learning and sequential prediction.

MDL and Neural Networks. Modern deep learning implicitly uses MDL-like ideas. Weight pruning, quantization, and knowledge distillation all compress the model. The lottery ticket hypothesis says that large networks contain small subnetworks that perform equally well — the description length of the subnetwork is much shorter. Kolmogorov complexity of trained weights has been proposed as a measure of what neural networks have learned.

Kolmogorov Complexity

The ideal, uncomputable foundation

↓ restrict to computable models

Practical MDL

Two-part codes, universal codes, NML

↔ closely related

Bayesian Inference

Marginal likelihood ≈ Bayesian mixture code length

The big picture. MDL provides a unifying framework: model selection, Occam's razor, Bayesian inference, PAC learning, and data compression are all facets of the same underlying principle — the best explanation of data is the one that compresses it most. Grünwald's paper puts this ancient intuition on rigorous mathematical footing, giving us NML as the uniquely principled instantiation.

Paper impact. This 2004 tutorial became one of the most-cited introductions to MDL. It influenced work in Bayesian nonparametrics, neural network compression, and automated machine learning (AutoML). The MDL principle remains relevant today: every time you prune a model, distill knowledge, or penalize complexity, you are, in some sense, doing MDL.

← Back to Veanors Hub

How does MDL's version of Occam's razor differ from the naive "always pick the simplest model" interpretation?

MDL ignores model complexity entirely MDL picks the model that minimizes total description length, which sometimes favors a more complex model if the data has enough structure to justify it MDL always picks the model with the fewest parameters

The MinimumDescription LengthPrinciple