The Coffee Automaton

Chapter 0: The Puzzle

Imagine you pour cream into a cup of black coffee. At the very start, the system is simple to describe: "cream on top, coffee on bottom." After a long time, it is equally simple: "uniform tan mixture everywhere." But in between — when tendrils of cream are swirling through dark coffee, when there are patches and whorls and filaments — the system looks complex.

Here is what is strange. The second law of thermodynamics says entropy — a measure of disorder — increases monotonically in a closed system. It starts low and ends high. Period. No exceptions. But complexity does something different. It starts low, rises, and then falls back to low. Entropy and complexity are not the same thing, and they follow different trajectories.

The big question: Can we actually measure this rise and fall? Is there a formal definition of "complexity" or "interestingness" that captures what our eyes see — that a half-mixed coffee cup is more interesting than a fully separated or fully mixed one? And if so, can we prove mathematically that complexity must rise before falling?

This is not just a question about coffee. The universe itself follows the same pattern. Near the Big Bang: smooth, hot plasma. Low entropy, low complexity. In 10¹⁰⁰ years: dispersed particles, heat death. High entropy, low complexity. But right now, between those endpoints, we have galaxies, stars, planets, and life. Maximum complexity at an intermediate time.

Physicists have talked about this informally for decades. Sean Carroll discussed it in From Eternity to Here. Murray Gell-Mann discussed it in The Quark and the Jaguar. But nobody had actually measured it. Nobody had built a toy system, defined a formal complexity measure, run the simulation, and plotted the curve.

Aaronson, Carroll, and Ouellette set out to do exactly that. They build the simplest possible model — a 2D cellular automaton of cream and coffee — and test whether they can see complexity rise and then fall as entropy increases. They can. And the result is surprisingly clean.

Along the way, they survey four different definitions of "complexity" from the algorithmic information theory and complex systems literature, show how they relate to each other, and identify which one is actually computable for their model system. The paper is part physics, part computer science, and part philosophy of what "interestingness" means.

What pattern does "complexity" follow in a closed system, according to this paper's hypothesis?

It increases monotonically, just like entropy It starts low, rises to a maximum at some intermediate time, then falls back to low as equilibrium is reached It decreases monotonically while entropy increases

Chapter 1: Entropy Refresher

Before we can define complexity, we need to be precise about entropy. The paper works with three flavors, all closely related.

Boltzmann entropy. You have a microstate x_a (the exact position of every particle). You group microstates into macrostates X_A (coarser descriptions). The Boltzmann entropy of a microstate is the log of how many microstates share its macrostate:

S_Boltzmann(x_a) = k_B log W_A

where W_A is the number of microstates in macrostate X_A. A fully mixed coffee cup has an enormous W_A — there are vastly more ways to arrange particles into a "uniform tan" mixture than into "cream on top, coffee on bottom" — so its Boltzmann entropy is high. The macrostate "fully mixed" is compatible with an astronomical number of microstates. The macrostate "perfectly separated" is compatible with very few.

Shannon (Gibbs) entropy. Given a probability distribution D = (p_x), the Shannon entropy measures how many random bits you need to sample from it:

H(D) = − ∑_x p_x log p_x

This is mathematically identical to the Gibbs entropy from physics. A uniform distribution (maximum ignorance) has maximum Shannon entropy. A delta spike on one outcome (certainty) has zero. A fair coin: H = 1 bit. A biased coin (99% heads): H ≈ 0.08 bits. The more predictable the outcome, the lower the entropy.

Kolmogorov complexity. Given a string x, K(x) is the length of the shortest computer program that outputs x. A string of a million ones has low K — just write a loop. A random-looking string has high K — you have to hard-code every bit. The crucial fact that makes this definition useful: switching from one programming language to another changes K(x) by at most an additive constant (you can always write a compiler for the other language). So K(x) is a property of the string, not the language.

Key connection: For a computable distribution D over n-bit strings, the Kolmogorov complexity of a typical string sampled from D converges to the Shannon entropy of D. So these three notions of entropy — Boltzmann, Shannon, and Kolmogorov — are really the same idea wearing different hats.

An example. Consider three 16-bit strings. String A: 1111111111111111. K(A) is small — "print 1 sixteen times." String B: 1010110010110100 (random-looking). K(B) is large — you have to spell out every bit. String C: 1111000011110000 (a pattern). K(C) is moderate — "print four 1s then four 0s, twice." Entropy (Kolmogorov complexity) ranks B highest. But our intuition about "interestingness" would rank C highest: it has structure that is neither trivial nor random.

None of these entropy measures capture complexity. A random string has maximum Kolmogorov complexity, yet we would not call it "interesting." A simple repeating pattern has minimum complexity. The interesting, structured objects — the swirling cream tendrils — live in between. We need a measure that is low for both extremes and high only in the middle.

Why can't we simply use Kolmogorov complexity K(x) as our measure of "interestingness"?

Because K(x) is maximized by random strings that have no interesting structure — a complexity measure should be low for both simple and random objects, high only for structured ones Because K(x) cannot be computed Because K(x) is always zero for physical systems

Chapter 2: Complexity ≠ Entropy

The paper surveys four candidate definitions of "complexity" from the literature. Each tries to capture the same intuition: a measure that is low for simple things, low for random things, and high only for things that are structured but not trivial.

1. Apparent complexity. Take an object x. Apply a "smoothing" or "denoising" function f that removes random noise. Then measure the entropy of the smoothed version: H(f(x)). The idea is simple: blur out the noise, keep the structure, then ask how much structure remains.

Apparent complexity = H(f(x))

Advantage: simple, computable. Disadvantage: you have to choose f, which feels arbitrary. Different choices of f could give different answers.

Think of it this way. A photograph of static (random noise) has high entropy but low apparent complexity — after blurring, the noise averages out and you are left with a featureless gray image. A photograph of a fractal coastline has moderate entropy but high apparent complexity — after blurring, the large-scale shape is preserved. A photograph of a blank wall has low entropy and low apparent complexity. This matches our intuition.

2. Sophistication (Koppel, 1987; refined by Gacs, Tromp, Vitanyi). Given a string x, find a "model" S (a set containing x) such that x looks like a generic random element of S. The sophistication is the Kolmogorov complexity of the best such model S:

soph_c(x) = min_S K(S) s.t. K(S) + log|S| ≤ K(x) + c

In words: find the simplest "rule" (the model S) that, combined with some random "data" (specifying x within S), produces x via a near-optimal two-part code. The sophistication is the cost of the rule, not the data.

Advantage: principled, no arbitrary smoothing function. Disadvantage: uncomputable, and — critically — it never becomes large for outputs of short probabilistic programs (like the coffee automaton).

3. Logical depth (Bennett, 1995). The time taken by the shortest near-minimal program that outputs x. Simple strings have short, fast programs. Random strings have long programs but those programs are trivially fast (just print the hard-coded data). Deep strings require short programs that run for a long time — they have been "cooked" by extensive computation.

Think of depth as the "computational history" embedded in x. The digits of pi are deep: a short formula exists, but computing a million digits takes a long time. Random noise is shallow: a short program outputs it in linear time (just print random bits). Bennett argues that depth captures the "amount of computational effort" invested in x.

The objection for our purposes: even short, fast programs can produce visually complex patterns. Conway's Game of Life generates intricate structures from trivial rules in a handful of steps. If visual complexity does not require long computation, then logical depth is measuring the wrong thing for the coffee automaton.

4. Light-cone complexity. The mutual information between a point's past and future light cones in a causal system: LCC(a) = I(V_P(a) : V_F(a)). High when the past predicts the future in a nontrivial way. Low when the system is trivial (small light cones, little information) or fully random (past and future are uncorrelated).

The synthesis: These four notions look wildly different, but they are related. Apparent complexity is a resource-bounded version of sophistication (with f defining the model S_f,x = {y : f(y) = f(x)}). Coarse sophistication equals Busy Beaver depth up to O(log n), as shown by Antunes and Fortnow. Light-cone complexity correlates with apparent complexity when contingent structures exist — structures that are useful for prediction but could have been different in a different run.

The authors chose apparent complexity for their experiments — it is the only one they could actually compute. Sophistication and logical depth are uncomputable (or have no known efficient approximation). Light-cone complexity requires estimating mutual information over exponentially large light cones. Apparent complexity, by contrast, just needs a smoothing function and a compression algorithm. Both are readily available.

This is a recurring theme in algorithmic information theory: the most theoretically elegant definitions (sophistication, logical depth) are the hardest to compute, while the more "ad hoc" definitions (apparent complexity) are practically tractable. The paper bridges this gap. It takes the tractable definition, applies it to a concrete physical system, and shows that the result matches our intuition about complexity. That is more than anyone had done before.

Why is sophistication unsuitable for measuring the complexity of the coffee automaton?

Because the coffee automaton is generated by a short probabilistic program, and sophistication never becomes large for outputs of such programs Because sophistication requires continuous data Because sophistication only applies to quantum systems

Chapter 3: The Coffee Automaton

The model is beautifully simple. Take an n × n grid of cells. Each cell is either cream (1) or coffee (0). The initial state: cream fills the top half, coffee fills the bottom half. A clean horizontal boundary.

This is a stochastic cellular automaton. At each discrete time step, the grid evolves according to a random transition rule. The system is closed: no particles enter or leave. The total number of cream cells and coffee cells is conserved. This conservation is crucial — it mirrors the conservation laws of real physics (energy, particle number) and ensures that the system approaches equilibrium rather than some absorbing state.

The initial state has perfect vertical symmetry: every row in the top half is all-cream, every row in the bottom half is all-coffee. The system also has horizontal symmetry (left-right). As mixing proceeds, the vertical symmetry is broken (cream penetrates downward, coffee upward), but horizontal symmetry is preserved in expectation.

The paper considers two transition rules:

Interacting Model

At each step, pick one pair of adjacent, differing cells. Swap them. Particles cannot overlap — the presence of cream in a cell blocks other cream from entering.

Non-Interacting Model

At each step, every cream particle independently moves one step in a random direction. Multiple cream particles can occupy the same cell. Coffee is just background.

The non-interacting model is easier to analyze theoretically because each particle does an independent random walk. The interacting model is more physically realistic — two particles cannot be in the same place at the same time. This mirrors the Pauli exclusion principle for fermions, or more simply, the everyday fact that two lumps of cream cannot occupy the same spot.

The interacting model evolves much more slowly than the non-interacting one. In the interacting model, only one pair of particles swaps per step, and only if they differ (cream next to coffee). Particles far from the boundary cannot move at all until the boundary has reached them. In the non-interacting model, every cream particle moves simultaneously, so the entire system evolves in parallel. This speed difference has consequences for the entropy curves: the interacting model shows a smooth, gradual entropy increase, while the non-interacting model shows a sudden jump after the first step.

Why two models? The paper proves analytically that complexity never becomes large in the non-interacting model. The expected coarse-grained state depends only on vertical position (by symmetry each row has the same expectation), so it can always be described with O(log n) bits. The interacting model is where the interesting behavior lives.

The interacting model produces tendrils and filaments as cream diffuses into coffee, because particles push each other around. When cream particle A wants to move right, it can only do so if the cell to its right contains coffee. This exclusion constraint couples the particles' motions. The boundary between cream and coffee develops a jagged, fractal-looking structure. This is where complexity peaks.

The non-interacting model, by contrast, produces a smooth gradient. Each particle does its own random walk, and when you average over many particles, the law of large numbers takes over. The coarse-grained view is always a smooth function of vertical position — boring at every time step. There are no tendrils, no filaments, no patches. Just a diffusing front.

Why does the interacting model generate more complexity? Because interactions break the independence that makes the non-interacting model smooth. In the non-interacting case, any two cream particles in the same row have statistically interchangeable trajectories. In the interacting case, a cream particle near the boundary has a very different fate from one far away: it participates in the mixing interface, where the complex structures form.

Here is a concrete way to see the difference. In the non-interacting model, the cream density at position (x, y) after t steps is determined (in expectation) solely by y and t. The x-coordinate is irrelevant because of horizontal symmetry. So the expected coarse-grained state has O(n) degrees of freedom — one per row — and each is deterministic given t. Total description: O(log n) bits.

In the interacting model, the exclusion constraint creates correlations between particles in the same row. The cream density at (x, y) depends not just on y but on the specific history of swaps that reached this row. Different runs produce wildly different boundary shapes. This contingency — the "could have been different" quality of the state — is exactly what complexity measures.

Note that both models reach the same equilibrium: complete mixing. The second law guarantees this. The difference is in what happens along the way. The interacting model takes a complex path through state space; the non-interacting model takes a simple one. Both end up at the same destination. The journey matters for complexity; the destination matters for entropy.

In the interacting model, what happens at each time step?

All particles move simultaneously in random directions One pair of adjacent, differing particles (one cream, one coffee) is selected and their positions are swapped Each particle moves toward the center of the grid

Chapter 4: Coarse-Graining

We want to measure apparent complexity: smooth out the noise, then measure what remains. The smoothing function f is coarse-graining.

Here is how it works. Take the n × n grid of 0s and 1s (the fine-grained state). For each cell, compute the average value of all cells within a g × g square centered on it. This gives a floating-point array of local cream densities. The value g is the grain size.

Fine-Grained

Individual cells: 0 or 1. Every particle tracked. Entropy = compressed size of this.

↓ average over g × g neighborhoods

Coarse-Grained

Local averages: values between 0 and 1. Then thresholded into discrete buckets. Complexity = compressed size of this.

The floating-point averages are then thresholded into discrete buckets. In the first experiment, three buckets: "mostly coffee" (near 0), "mixed" (near 0.5), and "mostly cream" (near 1). In the refined experiment, seven buckets with an adjustment algorithm that removes border artifacts.

Why threshold at all? Because compression algorithms work on discrete symbols, not floating-point numbers. The thresholding converts the continuous average into a small alphabet. A cell with average 0.12 becomes "mostly coffee." A cell with average 0.87 becomes "mostly cream." A cell with average 0.48 becomes "mixed." This is the paper's version of the smoothing function f: average over a neighborhood, then quantize into a small number of symbols.

The grain size g is a free parameter. Too small (g = 1) and the coarse-grained state is identical to the fine-grained state — no smoothing happens. Too large (g = n) and the entire grid is averaged into a single number — everything is over-smoothed. The paper experiments with several grain sizes and finds that the qualitative behavior is robust across reasonable choices.

The compressed file size (using gzip) of the coarse-grained, thresholded array approximates the Kolmogorov complexity K(S) of the macroscopic state — i.e., the apparent complexity. The compressed size of the fine-grained array approximates the total entropy K(x).

Think of it this way. The fine-grained array is the microstate: every individual particle tracked. The coarse-grained array is the macrostate: the large-scale pattern visible from a distance. The complexity of the macrostate is what we want. And gzip's output size, while not a perfect estimator of Kolmogorov complexity, is a consistently good upper bound that tracks the true value's trends faithfully.

Why gzip? Kolmogorov complexity K(x) is uncomputable — you cannot write a program to compute the shortest program for x. But compression algorithms like gzip give an upper bound on K(x). If gzip can compress a string to 200 bytes, then K(x) ≤ 200 (plus a constant for the decompressor). The tighter the compression, the better the bound. Crucially, gzip is consistent: it does not randomly produce good compression on one run and bad compression on another. Its output size tracks real patterns in the data.

One technical objection, raised by Shalizi, is that general-purpose compression programs do not provide accurate entropy estimates for all data types. The authors respond that in their experiments, the estimates are stable and consistent across runs. More importantly, the qualitative behavior — entropy rising, complexity rising then falling — does not depend on the specific compressor used.

The paper also describes an "adjusted" coarse-graining that uses seven thresholds and a majority-vote algorithm to flatten border artifacts. Imagine a row of the coarse-grained array where the true average hovers around 0.5. With three thresholds, some cells land in the "mixed" bucket and others in the "mostly cream" bucket, depending on tiny fluctuations. This creates apparent complexity where there is none — a border artifact.

The fix: allow each cell to be nudged up or down by one threshold level. Use a majority algorithm within each row: if most cells in a row say "mixed," adjust the outliers to match. This removes the noise while preserving genuine spatial structure (which would create multi-threshold differences that cannot be flattened by a one-step adjustment).

The result: the non-interacting model's complexity drops to near zero at all times, while the interacting model still shows the characteristic rise-and-fall. This is strong evidence that the interacting model's complexity is genuine, not an artifact of the measurement.

There is another metric the paper explored but ultimately found unsatisfactory: the Optimal Symbol Compression Ratio (OSCR) algorithm by Evans et al. This algorithm directly estimates both Kolmogorov complexity and sophistication via a two-part code (codebook + encoded data). In principle, it is more principled than gzip. In practice, the authors found that their implementation produced noisy results with no clear trends. They conjecture that OSCR fails here because it does not account for the two-dimensionality of the automaton state — it treats the grid as a one-dimensional string.

What does the compressed file size of the coarse-grained array approximate?

The total number of particles in the system The temperature of the system The apparent complexity — the Kolmogorov complexity of the macroscopic (smoothed) state, which captures only the non-random, structured information

Chapter 5: Apparent Complexity

Let us be precise about what apparent complexity measures and why it makes sense for the coffee automaton.

Given object x, smoothing function f, and entropy measure H:

Apparent complexity of x = H(f(x))

In our case: x is the n × n fine-grained grid (each cell is 0 or 1). f is the coarse-graining + thresholding procedure from the previous chapter. H is the Kolmogorov complexity, approximated by gzip compressed file size.

Why does this give the right answer at each stage of mixing?

Time	Fine-grained state	Coarse-grained state	Apparent complexity
t = 0	Cream top, coffee bottom. Ordered.	All-white top, all-black bottom.	Low — trivially compressible.
Intermediate	Tendrils, filaments, patches.	Complex mosaic of thresholds.	High — hard to compress the pattern.
t → ∞	Random-looking uniform mixture.	Uniform "mixed" threshold everywhere.	Low — trivially compressible again.

The fine-grained entropy (total K(x)) increases monotonically. It starts low (ordered) and ends high (random). But the coarse-grained complexity rises and falls. This is the signature the paper is looking for.

Why does complexity fall at the end? Because the coarse-grained state converges to a uniform "mixed" value everywhere. If every cell in the coarse-grained array has the same threshold value, gzip compresses it to nearly zero bytes — just "repeat this symbol N² times." The fine-grained state is still random (high entropy), but the smoothing has erased all the randomness, leaving nothing structured behind. The random details canceled out, just as noise cancels in any averaging process.

This is the heart of the paper's message. The entropy of the fine-grained state measures all information, including random noise. The complexity of the coarse-grained state measures only the structured information — the information that survives smoothing. At the start, there is structure but little randomness: low entropy, low complexity. In the middle, there is both structure and randomness: rising entropy, peak complexity. At the end, there is abundant randomness but no surviving structure: high entropy, low complexity. The two quantities tell fundamentally different stories about the same system.

The arbitrary-f objection. One might worry that the choice of smoothing function f is arbitrary. The authors counter this by analogy to Boltzmann entropy: defining S_Boltzmann also requires choosing a coarse-graining of phase space, and that choice is motivated by physics (interactions are local in position space). Similarly, coarse-graining the coffee grid over local spatial neighborhoods is physically natural — it is what our eyes do when we look at a coffee cup from a distance.

There is a beautiful connection to sophistication here. Apparent complexity is a resource-bounded form of sophistication. The model S for sophistication is the set S_f,x = {y : f(y) = f(x)} — all microstates that look the same after smoothing. The complexity of describing S_f,x equals the complexity of describing f(x). So apparent complexity is just sophistication where we have fixed the model-finding procedure to be "apply f."

This also explains why the choice of f matters less than you might fear. Any "natural" smoothing function — one that averages over local spatial neighborhoods — will produce similar results. Just as Boltzmann entropy is robust to reasonable choices of coarse-graining, apparent complexity is robust to reasonable choices of smoothing. You would get a strange answer only if you chose an unphysical f, like averaging over random non-contiguous cells.

The paper also connects this to light-cone complexity. The coarse-graining regions (g × g squares of contiguous cells) correspond roughly to the causal structure of the automaton. A particle at position (x, y) can only influence cells within a diamond-shaped region around it. By coarse-graining over local squares, we are effectively looking at the information that survives within a causal neighborhood — the same kind of information that light-cone complexity examines.

At what stage of coffee mixing is the apparent complexity highest?

Right at the beginning, when cream and coffee are fully separated At an intermediate time, when tendrils and filaments create structured but non-trivial patterns in the coarse-grained view At the end, when the liquids are fully mixed

Chapter 6: Other Measures

Let us go deeper into the three alternatives the paper considers but ultimately does not use for experiments.

Sophistication tries to separate the "non-random" part of a description from the "random" part. Given x, find a model S (a set containing x) such that:

K(S) + log|S| ≤ K(x) + c

and x is a "generic" element of S (knowing S leaves K(x|S) ≥ log|S| − c bits of uncertainty). The sophistication is the minimum K(S) over all such models.

The critical problem: for any short probabilistic program P (like the coffee automaton), the output x has low sophistication with overwhelming probability. Why? Take S = {all outputs y of P with Pr[y] ≈ Pr[x]}. This set S contains x, it takes only O(log n) bits to describe (the program P plus a probability threshold), and x is a generic element of S (since the probability of x within S is roughly uniform). So sophistication is at most O(log n) — never large — for the coffee automaton.

This is a deep limitation. It means sophistication, despite being the most theoretically principled complexity measure, is useless for studying physical systems that evolve by short, simple rules. The coffee automaton has simple rules and a simple initial state. Sophistication sees nothing interesting in its output. But our eyes see swirling cream tendrils. We need a measure that captures what we see.

Resource-bounded sophistication might save the day. The argument above fails if we constrain the computational resources available for finding S and for specifying x given S. If we require the program for S to run in polynomial time, the easy model S = {outputs of P with similar probability} may no longer be efficiently enumerable. Resource-bounded sophistication might then become large for the coffee automaton. The paper suggests this as a direction but does not pursue it.

Logical depth measures the running time of the shortest near-minimal program for x. As we discussed in Chapter 2, simple strings have short, fast programs, while random strings have long but trivially fast programs. "Deep" strings require short programs that take a very long time.

For the coffee automaton specifically, the objection is sharp: the automaton's transition rule is simple and fast. Generating the state at time t requires only t random choices. The state might look complex to our eyes, but its logical depth is modest. Running time is not measuring what we want.

Light-cone complexity (Shalizi et al.) measures how much a point's past predicts its future:

LCC(a) = I(V_P(a) : V_F(a)) = H(V_P(a)) + H(V_F(a)) − H(V_P(a), V_F(a))

When dynamics are too simple, both H terms are small. When dynamics are too random, the past and future are uncorrelated. Only in the intermediate regime is LCC large. But it has a flaw: in slowly-evolving systems after equilibrium, LCC stays large (the past light cone contains almost the same random data as the future one), giving a false positive.

What does logical depth measure?

The size of the compressed representation of x The mutual information between past and future The running time of the shortest near-minimal program that outputs x — how much computational work was needed to produce x

Chapter 7: Showcase — Coffee Mixing

Watch cream mix into coffee and see entropy and complexity diverge in real time. The left panel shows the 40 × 40 grid: light cells are cream, dark cells are coffee. The right panel plots two curves as the simulation runs. The teal curve is entropy (a proxy based on the number of unlike-neighbor transitions in the fine-grained grid). The orange curve is complexity (the same transition count applied to the coarse-grained grid).

Click Start to begin. Try the interacting model first: watch the boundary between cream and coffee develop jagged tendrils. Then reset and switch to non-interacting: the boundary stays smooth as particles diffuse independently. The difference in the complexity curves is striking.

Toggle between interacting (particles swap — exclusion enforced) and non-interacting (particles walk independently — multiple per cell allowed). Use the speed slider to control how many swaps happen per animation frame.

Speed: Ready

What to watch for: In the interacting model, the complexity curve (orange) rises as tendrils form, peaks when the boundary is maximally jagged, then falls as the mixture becomes uniform. Entropy (teal) only rises. In the non-interacting model, complexity barely rises at all — the smooth gradient is always easy to describe. The gap between the two models is the paper's central finding.

Simulation details. This simplified simulation uses transition counts (unlike-neighbor pairs) as a proxy for gzip compressed size. In the actual paper, states are serialized and compressed with gzip/bzip2/lzma. The qualitative behavior — entropy rising, complexity rising then falling — is the same. The interacting model swaps one random adjacent differing pair per "step" (scaled by the speed slider). The non-interacting model moves all cream particles simultaneously each step.

Why does the non-interacting model produce almost no complexity?

Because each particle does an independent random walk, so the coarse-grained state (averaged over many particles) is a smooth gradient that depends only on vertical position — always describable in O(log n) bits Because the particles do not move Because the grid is too small

Chapter 8: Results & Scaling

The paper runs the coffee automaton at multiple grid sizes and measures how the key quantities scale with n (the side length of the grid).

Quantity	Scaling with n	Why
Maximum entropy	O(n²)	Proportional to the total number of particles. A fully random n × n grid needs n² bits.
Maximum complexity	O(n)	Complexity develops along the 1D boundary between cream and coffee. The boundary's horizontal extent is proportional to n.
Time to peak complexity	O(n²)	Proportional to the number of particles — all particles need to participate in the mixing.

Linear complexity scaling is the key result. Maximum complexity grows as O(n), not O(n²). This means the complexity is fundamentally a boundary phenomenon. The interesting structure lives at the interface between the two liquids, and that interface is one-dimensional even though the system is two-dimensional. If you doubled the grid width but kept the initial boundary in the same place, you would roughly double the peak complexity.

The quadratic scaling of both max entropy (O(n²)) and time to peak complexity (O(n²)) makes physical sense. Max entropy is proportional to the number of particles, which grows as the area. The time to peak complexity is proportional to how long it takes n² particles to rearrange themselves into the maximally complex configuration — each swap moves one pair, so O(n²) swaps are needed.

These results are robust across compression algorithms. The paper tested gzip, bzip2, and lzma on the same simulation data. All three produce qualitatively identical curves: entropy increases monotonically, complexity rises and falls. The specific file sizes differ (lzma compresses tighter than gzip), but the shape is the same. This robustness is important: it suggests that the phenomenon is real, not an artifact of a particular compression algorithm's biases.

Compressor	Entropy trend	Complexity trend	Peak matches?
gzip	Monotonically increasing	Rise then fall	Yes
bzip2	Monotonically increasing	Rise then fall	Yes
lzma	Monotonically increasing	Rise then fall	Yes

The non-interacting case — a theorem. For the non-interacting automaton with periodic boundary conditions, the paper proves that the Kolmogorov complexity of the coarse-grained state is O(log n) at all times, provided the grain size L satisfies L > Θ(G √(log n)). The proof sketch goes like this:

Let a_t(B) be the number of cream particles in an L × L block B at time t. Since particles do independent random walks, a_t(B) is a sum of independent 0/1 random variables. By a Chernoff bound:

Pr[|a_t(B) − E[a_t(B)]| > L²/G] < 2 exp(−L² / 3G²)

If L is large enough (specifically L > G √(3 ln(2n²))), then the probability that any block deviates from its expectation is less than 1/n² by a union bound. And since the expectation depends only on the vertical position (by horizontal symmetry), the coarse-grained state is determined by at most n values, each of which is predictable from n and t alone. Total Kolmogorov complexity: O(log n).

Why O(log n)? Once you know n and t, you can compute the expected cream density at each row. The coarse-grained state matches this prediction (up to adjustments). So the entire description is: "here is the grid size, here is the time step." That takes O(log n + log t) = O(log n) bits.

This means the adjusted coarse-graining correctly produces flat complexity for the non-interacting model. The small bump seen in the basic coarse-graining was a thresholding artifact, not genuine complexity.

The adjusted coarse-graining. The paper refines the basic method by using seven thresholds (instead of three) and a majority-vote algorithm. For each row, if a cell's threshold is within one step of the row majority, it gets flattened to the majority. This kills the border artifacts — rows that fluctuate between two nearby thresholds get smoothed to a single value. The result: the non-interacting model's complexity drops to near zero, while the interacting model's complexity curve is preserved.

Open problem. The paper proves analytically that complexity stays low in the non-interacting case. But for the interacting case, they only have numerical evidence that complexity reaches O(n). Proving a lower bound on the complexity of the interacting automaton at intermediate times remains open. This is hard because a lower bound on Kolmogorov complexity means showing that no program of fewer than Ω(n) bits can describe the state — a claim about all possible compression algorithms. Upper bounds are easy (just exhibit a compression). Lower bounds require ruling out every possible compression.

Why might the interacting model achieve O(n) complexity? The authors offer an intuitive argument: the automaton begins with a horizontal boundary between cream and coffee. As mixing proceeds, this boundary becomes increasingly jagged. The boundary has horizontal extent n. The complexity of describing a random jagged boundary of length n should be Θ(n). After sufficient mixing, the boundary dissolves and the coarse-grained state becomes uniform again. So the complexity rises to O(n) and then falls. But formalizing "random jagged boundary" into a rigorous proof remains elusive.

How does the maximum complexity of the interacting coffee automaton scale with grid side length n?

O(n²) — proportional to the total area O(n) — linear in the side length, because complexity is a boundary phenomenon O(log n) — logarithmic, the same as the non-interacting case

Chapter 9: Connections

Thermodynamic Arrow of Time

Entropy increases monotonically. This paper suggests a parallel: a "thermodynamic arrow of complexity" that rises and falls. Both are consequences of starting from a low-entropy initial state.

↓

Cosmological Complexity

The universe: low complexity at the Big Bang, high complexity now (galaxies, life), low complexity at heat death. This paper's coffee automaton is a toy model of that trajectory.

↓

Algorithmic Information Theory

The paper connects Kolmogorov complexity, sophistication, logical depth, and apparent complexity into a unified picture. These had been studied separately since the 1980s.

The Minimum Description Length principle. Coarse-graining the coffee automaton is closely related to two-part codes in MDL: the model S (coarse-grained state) plus the data given the model (fine-grained residual). The sophistication K(S) is the MDL model cost. This connects the paper to practical machine learning, where MDL is used for model selection. In MDL, you want a model that is neither too simple (underfitting) nor too complex (overfitting). The optimal model has intermediate complexity. The coffee automaton at its peak complexity is, in a sense, the physical analog of this optimal model — the state that is hardest to compress at the macro scale.

Renormalization group. Coarse-graining in physics is formalized by the renormalization group: integrating out small-scale degrees of freedom to study large-scale behavior. The paper's smoothing function f is a simple form of real-space renormalization. When you average over a g × g neighborhood, you are integrating out fluctuations at scales smaller than g. The "complexity" at each scale is determined by the structure that survives this integration.

Cellular automata and emergent computation. The coffee automaton joins a lineage that includes Conway's Game of Life and Wolfram's elementary automata. But the focus here is different. Wolfram studied whether simple rules could produce complex dynamics. Aaronson, Carroll, and Ouellette study whether those dynamics produce complex states at intermediate times — and whether the complexity eventually disappears. The answer is yes, at least for the interacting model.

Biological complexity. Living organisms are the most complex structures in the known universe, and they exist precisely in the thermodynamic middle ground: far from equilibrium, sustained by energy flows. The coffee automaton does not model life, but it captures the same principle in miniature. Complexity requires a system that has left its initial ordered state but has not yet reached equilibrium. Life is the universe's version of the swirling cream tendrils.

Connection to deep learning. The idea that complexity lives between order and randomness appears in neural network theory as the "edge of chaos" hypothesis: networks trained at the boundary between ordered and chaotic dynamics learn the most useful representations. The coffee automaton gives a clean, information-theoretic version of this intuition.

Why Ilya Sutskever included this paper. This paper is one of Sutskever's "30 under 30" recommended readings. It connects computational complexity, statistical mechanics, and information theory in a single, concrete model. For someone thinking about intelligence and complexity, the core message is powerful: interesting structure is transient. It arises between order and randomness, and it eventually disappears. Understanding when and why complexity peaks may be essential to understanding both the physical universe and the learned representations inside neural networks.

Authors. Scott Aaronson is a theoretical computer scientist known for quantum computing complexity theory. Sean M. Carroll is a theoretical physicist known for work on cosmology, entropy, and the arrow of time. Lauren Ouellette was a student at MIT when the work was done. The paper reflects the unique combination of computational complexity thinking (Aaronson), physical intuition about entropy and the arrow of time (Carroll), and careful implementation (Ouellette).

What remains open. The big unsolved problem: prove that the interacting coffee automaton actually achieves complexity Ω(n) at intermediate times. The paper only has numerical evidence and an upper bound. A lower bound would require showing that no compression algorithm can describe the coarse-grained state in fewer than Ω(n) bits — a statement about all possible programs, which is fundamentally harder than exhibiting a specific compression. A second open direction: extend the framework to systems with conservation laws, phase transitions, or three spatial dimensions.

The lasting insight. This is the first quantitative exploration of the "rise and fall of complexity" in closed systems. The idea had been discussed informally by Carroll, Gell-Mann, and others, but never measured. The coffee automaton gives a concrete, reproducible model where the phenomenon can be seen and studied. The framework — coarse-grain, compress, plot — is now a standard tool for studying emergent complexity.

Paper details. "Quantifying the Rise and Fall of Complexity in Closed Systems: The Coffee Automaton," Scott Aaronson, Sean M. Carroll, Lauren Ouellette. arXiv:1405.6903, 2014. Source code for the original automaton is available at scottaaronson.com.

Key references from the paper. Bennett (1995) on logical depth. Gacs, Tromp, Vitanyi (2001) on algorithmic statistics. Shalizi, Shalizi, Haslinger (2004) on light-cone complexity. Antunes and Fortnow (2009) on the equivalence of coarse sophistication and Busy Beaver depth. Carroll (2010), From Eternity to Here, for the cosmological context.

← Back to Veanors Hub

What is the main open problem left by this paper?

Whether entropy increases monotonically Whether coarse-graining is a valid technique Proving analytically that the interacting coffee automaton achieves complexity Ω(n) at intermediate times — a lower bound, not just numerical evidence

The Rise and Fall ofComplexity