Entropy only goes up. But complexity — the "interestingness" of a system — rises and then falls. This paper builds a toy universe (cream mixing into coffee) and measures it.
Imagine you pour cream into a cup of black coffee. At the very start, the system is simple to describe: "cream on top, coffee on bottom." After a long time, it is equally simple: "uniform tan mixture everywhere." But in between — when tendrils of cream are swirling through dark coffee, when there are patches and whorls and filaments — the system looks complex.
Here is what is strange. The second law of thermodynamics says entropy — a measure of disorder — increases monotonically in a closed system. It starts low and ends high. Period. No exceptions. But complexity does something different. It starts low, rises, and then falls back to low. Entropy and complexity are not the same thing, and they follow different trajectories.
This is not just a question about coffee. The universe itself follows the same pattern. Near the Big Bang: smooth, hot plasma. Low entropy, low complexity. In 10100 years: dispersed particles, heat death. High entropy, low complexity. But right now, between those endpoints, we have galaxies, stars, planets, and life. Maximum complexity at an intermediate time.
Physicists have talked about this informally for decades. Sean Carroll discussed it in From Eternity to Here. Murray Gell-Mann discussed it in The Quark and the Jaguar. But nobody had actually measured it. Nobody had built a toy system, defined a formal complexity measure, run the simulation, and plotted the curve.
Aaronson, Carroll, and Ouellette set out to do exactly that. They build the simplest possible model — a 2D cellular automaton of cream and coffee — and test whether they can see complexity rise and then fall as entropy increases. They can. And the result is surprisingly clean.
Along the way, they survey four different definitions of "complexity" from the algorithmic information theory and complex systems literature, show how they relate to each other, and identify which one is actually computable for their model system. The paper is part physics, part computer science, and part philosophy of what "interestingness" means.
Before we can define complexity, we need to be precise about entropy. The paper works with three flavors, all closely related.
Boltzmann entropy. You have a microstate xa (the exact position of every particle). You group microstates into macrostates XA (coarser descriptions). The Boltzmann entropy of a microstate is the log of how many microstates share its macrostate:
where WA is the number of microstates in macrostate XA. A fully mixed coffee cup has an enormous WA — there are vastly more ways to arrange particles into a "uniform tan" mixture than into "cream on top, coffee on bottom" — so its Boltzmann entropy is high. The macrostate "fully mixed" is compatible with an astronomical number of microstates. The macrostate "perfectly separated" is compatible with very few.
Shannon (Gibbs) entropy. Given a probability distribution D = (px), the Shannon entropy measures how many random bits you need to sample from it:
This is mathematically identical to the Gibbs entropy from physics. A uniform distribution (maximum ignorance) has maximum Shannon entropy. A delta spike on one outcome (certainty) has zero. A fair coin: H = 1 bit. A biased coin (99% heads): H ≈ 0.08 bits. The more predictable the outcome, the lower the entropy.
Kolmogorov complexity. Given a string x, K(x) is the length of the shortest computer program that outputs x. A string of a million ones has low K — just write a loop. A random-looking string has high K — you have to hard-code every bit. The crucial fact that makes this definition useful: switching from one programming language to another changes K(x) by at most an additive constant (you can always write a compiler for the other language). So K(x) is a property of the string, not the language.
1111111111111111. K(A) is small — "print 1 sixteen times." String B: 1010110010110100 (random-looking). K(B) is large — you have to spell out every bit. String C: 1111000011110000 (a pattern). K(C) is moderate — "print four 1s then four 0s, twice." Entropy (Kolmogorov complexity) ranks B highest. But our intuition about "interestingness" would rank C highest: it has structure that is neither trivial nor random.None of these entropy measures capture complexity. A random string has maximum Kolmogorov complexity, yet we would not call it "interesting." A simple repeating pattern has minimum complexity. The interesting, structured objects — the swirling cream tendrils — live in between. We need a measure that is low for both extremes and high only in the middle.
The paper surveys four candidate definitions of "complexity" from the literature. Each tries to capture the same intuition: a measure that is low for simple things, low for random things, and high only for things that are structured but not trivial.
1. Apparent complexity. Take an object x. Apply a "smoothing" or "denoising" function f that removes random noise. Then measure the entropy of the smoothed version: H(f(x)). The idea is simple: blur out the noise, keep the structure, then ask how much structure remains.
Advantage: simple, computable. Disadvantage: you have to choose f, which feels arbitrary. Different choices of f could give different answers.
Think of it this way. A photograph of static (random noise) has high entropy but low apparent complexity — after blurring, the noise averages out and you are left with a featureless gray image. A photograph of a fractal coastline has moderate entropy but high apparent complexity — after blurring, the large-scale shape is preserved. A photograph of a blank wall has low entropy and low apparent complexity. This matches our intuition.
2. Sophistication (Koppel, 1987; refined by Gacs, Tromp, Vitanyi). Given a string x, find a "model" S (a set containing x) such that x looks like a generic random element of S. The sophistication is the Kolmogorov complexity of the best such model S:
In words: find the simplest "rule" (the model S) that, combined with some random "data" (specifying x within S), produces x via a near-optimal two-part code. The sophistication is the cost of the rule, not the data.
Advantage: principled, no arbitrary smoothing function. Disadvantage: uncomputable, and — critically — it never becomes large for outputs of short probabilistic programs (like the coffee automaton).
3. Logical depth (Bennett, 1995). The time taken by the shortest near-minimal program that outputs x. Simple strings have short, fast programs. Random strings have long programs but those programs are trivially fast (just print the hard-coded data). Deep strings require short programs that run for a long time — they have been "cooked" by extensive computation.
Think of depth as the "computational history" embedded in x. The digits of pi are deep: a short formula exists, but computing a million digits takes a long time. Random noise is shallow: a short program outputs it in linear time (just print random bits). Bennett argues that depth captures the "amount of computational effort" invested in x.
The objection for our purposes: even short, fast programs can produce visually complex patterns. Conway's Game of Life generates intricate structures from trivial rules in a handful of steps. If visual complexity does not require long computation, then logical depth is measuring the wrong thing for the coffee automaton.
4. Light-cone complexity. The mutual information between a point's past and future light cones in a causal system: LCC(a) = I(VP(a) : VF(a)). High when the past predicts the future in a nontrivial way. Low when the system is trivial (small light cones, little information) or fully random (past and future are uncorrelated).
The authors chose apparent complexity for their experiments — it is the only one they could actually compute. Sophistication and logical depth are uncomputable (or have no known efficient approximation). Light-cone complexity requires estimating mutual information over exponentially large light cones. Apparent complexity, by contrast, just needs a smoothing function and a compression algorithm. Both are readily available.
This is a recurring theme in algorithmic information theory: the most theoretically elegant definitions (sophistication, logical depth) are the hardest to compute, while the more "ad hoc" definitions (apparent complexity) are practically tractable. The paper bridges this gap. It takes the tractable definition, applies it to a concrete physical system, and shows that the result matches our intuition about complexity. That is more than anyone had done before.
The model is beautifully simple. Take an n × n grid of cells. Each cell is either cream (1) or coffee (0). The initial state: cream fills the top half, coffee fills the bottom half. A clean horizontal boundary.
This is a stochastic cellular automaton. At each discrete time step, the grid evolves according to a random transition rule. The system is closed: no particles enter or leave. The total number of cream cells and coffee cells is conserved. This conservation is crucial — it mirrors the conservation laws of real physics (energy, particle number) and ensures that the system approaches equilibrium rather than some absorbing state.
The initial state has perfect vertical symmetry: every row in the top half is all-cream, every row in the bottom half is all-coffee. The system also has horizontal symmetry (left-right). As mixing proceeds, the vertical symmetry is broken (cream penetrates downward, coffee upward), but horizontal symmetry is preserved in expectation.
The paper considers two transition rules:
The non-interacting model is easier to analyze theoretically because each particle does an independent random walk. The interacting model is more physically realistic — two particles cannot be in the same place at the same time. This mirrors the Pauli exclusion principle for fermions, or more simply, the everyday fact that two lumps of cream cannot occupy the same spot.
The interacting model evolves much more slowly than the non-interacting one. In the interacting model, only one pair of particles swaps per step, and only if they differ (cream next to coffee). Particles far from the boundary cannot move at all until the boundary has reached them. In the non-interacting model, every cream particle moves simultaneously, so the entire system evolves in parallel. This speed difference has consequences for the entropy curves: the interacting model shows a smooth, gradual entropy increase, while the non-interacting model shows a sudden jump after the first step.
The interacting model produces tendrils and filaments as cream diffuses into coffee, because particles push each other around. When cream particle A wants to move right, it can only do so if the cell to its right contains coffee. This exclusion constraint couples the particles' motions. The boundary between cream and coffee develops a jagged, fractal-looking structure. This is where complexity peaks.
The non-interacting model, by contrast, produces a smooth gradient. Each particle does its own random walk, and when you average over many particles, the law of large numbers takes over. The coarse-grained view is always a smooth function of vertical position — boring at every time step. There are no tendrils, no filaments, no patches. Just a diffusing front.
Why does the interacting model generate more complexity? Because interactions break the independence that makes the non-interacting model smooth. In the non-interacting case, any two cream particles in the same row have statistically interchangeable trajectories. In the interacting case, a cream particle near the boundary has a very different fate from one far away: it participates in the mixing interface, where the complex structures form.
Here is a concrete way to see the difference. In the non-interacting model, the cream density at position (x, y) after t steps is determined (in expectation) solely by y and t. The x-coordinate is irrelevant because of horizontal symmetry. So the expected coarse-grained state has O(n) degrees of freedom — one per row — and each is deterministic given t. Total description: O(log n) bits.
In the interacting model, the exclusion constraint creates correlations between particles in the same row. The cream density at (x, y) depends not just on y but on the specific history of swaps that reached this row. Different runs produce wildly different boundary shapes. This contingency — the "could have been different" quality of the state — is exactly what complexity measures.
Note that both models reach the same equilibrium: complete mixing. The second law guarantees this. The difference is in what happens along the way. The interacting model takes a complex path through state space; the non-interacting model takes a simple one. Both end up at the same destination. The journey matters for complexity; the destination matters for entropy.
We want to measure apparent complexity: smooth out the noise, then measure what remains. The smoothing function f is coarse-graining.
Here is how it works. Take the n × n grid of 0s and 1s (the fine-grained state). For each cell, compute the average value of all cells within a g × g square centered on it. This gives a floating-point array of local cream densities. The value g is the grain size.
The floating-point averages are then thresholded into discrete buckets. In the first experiment, three buckets: "mostly coffee" (near 0), "mixed" (near 0.5), and "mostly cream" (near 1). In the refined experiment, seven buckets with an adjustment algorithm that removes border artifacts.
Why threshold at all? Because compression algorithms work on discrete symbols, not floating-point numbers. The thresholding converts the continuous average into a small alphabet. A cell with average 0.12 becomes "mostly coffee." A cell with average 0.87 becomes "mostly cream." A cell with average 0.48 becomes "mixed." This is the paper's version of the smoothing function f: average over a neighborhood, then quantize into a small number of symbols.
The grain size g is a free parameter. Too small (g = 1) and the coarse-grained state is identical to the fine-grained state — no smoothing happens. Too large (g = n) and the entire grid is averaged into a single number — everything is over-smoothed. The paper experiments with several grain sizes and finds that the qualitative behavior is robust across reasonable choices.
The compressed file size (using gzip) of the coarse-grained, thresholded array approximates the Kolmogorov complexity K(S) of the macroscopic state — i.e., the apparent complexity. The compressed size of the fine-grained array approximates the total entropy K(x).
Think of it this way. The fine-grained array is the microstate: every individual particle tracked. The coarse-grained array is the macrostate: the large-scale pattern visible from a distance. The complexity of the macrostate is what we want. And gzip's output size, while not a perfect estimator of Kolmogorov complexity, is a consistently good upper bound that tracks the true value's trends faithfully.
One technical objection, raised by Shalizi, is that general-purpose compression programs do not provide accurate entropy estimates for all data types. The authors respond that in their experiments, the estimates are stable and consistent across runs. More importantly, the qualitative behavior — entropy rising, complexity rising then falling — does not depend on the specific compressor used.
The paper also describes an "adjusted" coarse-graining that uses seven thresholds and a majority-vote algorithm to flatten border artifacts. Imagine a row of the coarse-grained array where the true average hovers around 0.5. With three thresholds, some cells land in the "mixed" bucket and others in the "mostly cream" bucket, depending on tiny fluctuations. This creates apparent complexity where there is none — a border artifact.
The fix: allow each cell to be nudged up or down by one threshold level. Use a majority algorithm within each row: if most cells in a row say "mixed," adjust the outliers to match. This removes the noise while preserving genuine spatial structure (which would create multi-threshold differences that cannot be flattened by a one-step adjustment).
The result: the non-interacting model's complexity drops to near zero at all times, while the interacting model still shows the characteristic rise-and-fall. This is strong evidence that the interacting model's complexity is genuine, not an artifact of the measurement.
There is another metric the paper explored but ultimately found unsatisfactory: the Optimal Symbol Compression Ratio (OSCR) algorithm by Evans et al. This algorithm directly estimates both Kolmogorov complexity and sophistication via a two-part code (codebook + encoded data). In principle, it is more principled than gzip. In practice, the authors found that their implementation produced noisy results with no clear trends. They conjecture that OSCR fails here because it does not account for the two-dimensionality of the automaton state — it treats the grid as a one-dimensional string.
Let us be precise about what apparent complexity measures and why it makes sense for the coffee automaton.
Given object x, smoothing function f, and entropy measure H:
In our case: x is the n × n fine-grained grid (each cell is 0 or 1). f is the coarse-graining + thresholding procedure from the previous chapter. H is the Kolmogorov complexity, approximated by gzip compressed file size.
Why does this give the right answer at each stage of mixing?
| Time | Fine-grained state | Coarse-grained state | Apparent complexity |
|---|---|---|---|
| t = 0 | Cream top, coffee bottom. Ordered. | All-white top, all-black bottom. | Low — trivially compressible. |
| Intermediate | Tendrils, filaments, patches. | Complex mosaic of thresholds. | High — hard to compress the pattern. |
| t → ∞ | Random-looking uniform mixture. | Uniform "mixed" threshold everywhere. | Low — trivially compressible again. |
The fine-grained entropy (total K(x)) increases monotonically. It starts low (ordered) and ends high (random). But the coarse-grained complexity rises and falls. This is the signature the paper is looking for.
Why does complexity fall at the end? Because the coarse-grained state converges to a uniform "mixed" value everywhere. If every cell in the coarse-grained array has the same threshold value, gzip compresses it to nearly zero bytes — just "repeat this symbol N2 times." The fine-grained state is still random (high entropy), but the smoothing has erased all the randomness, leaving nothing structured behind. The random details canceled out, just as noise cancels in any averaging process.
This is the heart of the paper's message. The entropy of the fine-grained state measures all information, including random noise. The complexity of the coarse-grained state measures only the structured information — the information that survives smoothing. At the start, there is structure but little randomness: low entropy, low complexity. In the middle, there is both structure and randomness: rising entropy, peak complexity. At the end, there is abundant randomness but no surviving structure: high entropy, low complexity. The two quantities tell fundamentally different stories about the same system.
There is a beautiful connection to sophistication here. Apparent complexity is a resource-bounded form of sophistication. The model S for sophistication is the set Sf,x = {y : f(y) = f(x)} — all microstates that look the same after smoothing. The complexity of describing Sf,x equals the complexity of describing f(x). So apparent complexity is just sophistication where we have fixed the model-finding procedure to be "apply f."
This also explains why the choice of f matters less than you might fear. Any "natural" smoothing function — one that averages over local spatial neighborhoods — will produce similar results. Just as Boltzmann entropy is robust to reasonable choices of coarse-graining, apparent complexity is robust to reasonable choices of smoothing. You would get a strange answer only if you chose an unphysical f, like averaging over random non-contiguous cells.
The paper also connects this to light-cone complexity. The coarse-graining regions (g × g squares of contiguous cells) correspond roughly to the causal structure of the automaton. A particle at position (x, y) can only influence cells within a diamond-shaped region around it. By coarse-graining over local squares, we are effectively looking at the information that survives within a causal neighborhood — the same kind of information that light-cone complexity examines.
Let us go deeper into the three alternatives the paper considers but ultimately does not use for experiments.
Sophistication tries to separate the "non-random" part of a description from the "random" part. Given x, find a model S (a set containing x) such that:
and x is a "generic" element of S (knowing S leaves K(x|S) ≥ log|S| − c bits of uncertainty). The sophistication is the minimum K(S) over all such models.
The critical problem: for any short probabilistic program P (like the coffee automaton), the output x has low sophistication with overwhelming probability. Why? Take S = {all outputs y of P with Pr[y] ≈ Pr[x]}. This set S contains x, it takes only O(log n) bits to describe (the program P plus a probability threshold), and x is a generic element of S (since the probability of x within S is roughly uniform). So sophistication is at most O(log n) — never large — for the coffee automaton.
This is a deep limitation. It means sophistication, despite being the most theoretically principled complexity measure, is useless for studying physical systems that evolve by short, simple rules. The coffee automaton has simple rules and a simple initial state. Sophistication sees nothing interesting in its output. But our eyes see swirling cream tendrils. We need a measure that captures what we see.
Logical depth measures the running time of the shortest near-minimal program for x. As we discussed in Chapter 2, simple strings have short, fast programs, while random strings have long but trivially fast programs. "Deep" strings require short programs that take a very long time.
For the coffee automaton specifically, the objection is sharp: the automaton's transition rule is simple and fast. Generating the state at time t requires only t random choices. The state might look complex to our eyes, but its logical depth is modest. Running time is not measuring what we want.
Light-cone complexity (Shalizi et al.) measures how much a point's past predicts its future:
When dynamics are too simple, both H terms are small. When dynamics are too random, the past and future are uncorrelated. Only in the intermediate regime is LCC large. But it has a flaw: in slowly-evolving systems after equilibrium, LCC stays large (the past light cone contains almost the same random data as the future one), giving a false positive.
Watch cream mix into coffee and see entropy and complexity diverge in real time. The left panel shows the 40 × 40 grid: light cells are cream, dark cells are coffee. The right panel plots two curves as the simulation runs. The teal curve is entropy (a proxy based on the number of unlike-neighbor transitions in the fine-grained grid). The orange curve is complexity (the same transition count applied to the coarse-grained grid).
Click Start to begin. Try the interacting model first: watch the boundary between cream and coffee develop jagged tendrils. Then reset and switch to non-interacting: the boundary stays smooth as particles diffuse independently. The difference in the complexity curves is striking.
Toggle between interacting (particles swap — exclusion enforced) and non-interacting (particles walk independently — multiple per cell allowed). Use the speed slider to control how many swaps happen per animation frame.
The paper runs the coffee automaton at multiple grid sizes and measures how the key quantities scale with n (the side length of the grid).
| Quantity | Scaling with n | Why |
|---|---|---|
| Maximum entropy | O(n2) | Proportional to the total number of particles. A fully random n × n grid needs n2 bits. |
| Maximum complexity | O(n) | Complexity develops along the 1D boundary between cream and coffee. The boundary's horizontal extent is proportional to n. |
| Time to peak complexity | O(n2) | Proportional to the number of particles — all particles need to participate in the mixing. |
The quadratic scaling of both max entropy (O(n2)) and time to peak complexity (O(n2)) makes physical sense. Max entropy is proportional to the number of particles, which grows as the area. The time to peak complexity is proportional to how long it takes n2 particles to rearrange themselves into the maximally complex configuration — each swap moves one pair, so O(n2) swaps are needed.
These results are robust across compression algorithms. The paper tested gzip, bzip2, and lzma on the same simulation data. All three produce qualitatively identical curves: entropy increases monotonically, complexity rises and falls. The specific file sizes differ (lzma compresses tighter than gzip), but the shape is the same. This robustness is important: it suggests that the phenomenon is real, not an artifact of a particular compression algorithm's biases.
| Compressor | Entropy trend | Complexity trend | Peak matches? |
|---|---|---|---|
| gzip | Monotonically increasing | Rise then fall | Yes |
| bzip2 | Monotonically increasing | Rise then fall | Yes |
| lzma | Monotonically increasing | Rise then fall | Yes |
The non-interacting case — a theorem. For the non-interacting automaton with periodic boundary conditions, the paper proves that the Kolmogorov complexity of the coarse-grained state is O(log n) at all times, provided the grain size L satisfies L > Θ(G √(log n)). The proof sketch goes like this:
Let at(B) be the number of cream particles in an L × L block B at time t. Since particles do independent random walks, at(B) is a sum of independent 0/1 random variables. By a Chernoff bound:
If L is large enough (specifically L > G √(3 ln(2n2))), then the probability that any block deviates from its expectation is less than 1/n2 by a union bound. And since the expectation depends only on the vertical position (by horizontal symmetry), the coarse-grained state is determined by at most n values, each of which is predictable from n and t alone. Total Kolmogorov complexity: O(log n).
This means the adjusted coarse-graining correctly produces flat complexity for the non-interacting model. The small bump seen in the basic coarse-graining was a thresholding artifact, not genuine complexity.
The adjusted coarse-graining. The paper refines the basic method by using seven thresholds (instead of three) and a majority-vote algorithm. For each row, if a cell's threshold is within one step of the row majority, it gets flattened to the majority. This kills the border artifacts — rows that fluctuate between two nearby thresholds get smoothed to a single value. The result: the non-interacting model's complexity drops to near zero, while the interacting model's complexity curve is preserved.
Why might the interacting model achieve O(n) complexity? The authors offer an intuitive argument: the automaton begins with a horizontal boundary between cream and coffee. As mixing proceeds, this boundary becomes increasingly jagged. The boundary has horizontal extent n. The complexity of describing a random jagged boundary of length n should be Θ(n). After sufficient mixing, the boundary dissolves and the coarse-grained state becomes uniform again. So the complexity rises to O(n) and then falls. But formalizing "random jagged boundary" into a rigorous proof remains elusive.
The Minimum Description Length principle. Coarse-graining the coffee automaton is closely related to two-part codes in MDL: the model S (coarse-grained state) plus the data given the model (fine-grained residual). The sophistication K(S) is the MDL model cost. This connects the paper to practical machine learning, where MDL is used for model selection. In MDL, you want a model that is neither too simple (underfitting) nor too complex (overfitting). The optimal model has intermediate complexity. The coffee automaton at its peak complexity is, in a sense, the physical analog of this optimal model — the state that is hardest to compress at the macro scale.
Renormalization group. Coarse-graining in physics is formalized by the renormalization group: integrating out small-scale degrees of freedom to study large-scale behavior. The paper's smoothing function f is a simple form of real-space renormalization. When you average over a g × g neighborhood, you are integrating out fluctuations at scales smaller than g. The "complexity" at each scale is determined by the structure that survives this integration.
Cellular automata and emergent computation. The coffee automaton joins a lineage that includes Conway's Game of Life and Wolfram's elementary automata. But the focus here is different. Wolfram studied whether simple rules could produce complex dynamics. Aaronson, Carroll, and Ouellette study whether those dynamics produce complex states at intermediate times — and whether the complexity eventually disappears. The answer is yes, at least for the interacting model.
Biological complexity. Living organisms are the most complex structures in the known universe, and they exist precisely in the thermodynamic middle ground: far from equilibrium, sustained by energy flows. The coffee automaton does not model life, but it captures the same principle in miniature. Complexity requires a system that has left its initial ordered state but has not yet reached equilibrium. Life is the universe's version of the swirling cream tendrils.
Connection to deep learning. The idea that complexity lives between order and randomness appears in neural network theory as the "edge of chaos" hypothesis: networks trained at the boundary between ordered and chaotic dynamics learn the most useful representations. The coffee automaton gives a clean, information-theoretic version of this intuition.
Why Ilya Sutskever included this paper. This paper is one of Sutskever's "30 under 30" recommended readings. It connects computational complexity, statistical mechanics, and information theory in a single, concrete model. For someone thinking about intelligence and complexity, the core message is powerful: interesting structure is transient. It arises between order and randomness, and it eventually disappears. Understanding when and why complexity peaks may be essential to understanding both the physical universe and the learned representations inside neural networks.
Authors. Scott Aaronson is a theoretical computer scientist known for quantum computing complexity theory. Sean M. Carroll is a theoretical physicist known for work on cosmology, entropy, and the arrow of time. Lauren Ouellette was a student at MIT when the work was done. The paper reflects the unique combination of computational complexity thinking (Aaronson), physical intuition about entropy and the arrow of time (Carroll), and careful implementation (Ouellette).
What remains open. The big unsolved problem: prove that the interacting coffee automaton actually achieves complexity Ω(n) at intermediate times. The paper only has numerical evidence and an upper bound. A lower bound would require showing that no compression algorithm can describe the coarse-grained state in fewer than Ω(n) bits — a statement about all possible programs, which is fundamentally harder than exhibiting a specific compression. A second open direction: extend the framework to systems with conservation laws, phase transitions, or three spatial dimensions.
Paper details. "Quantifying the Rise and Fall of Complexity in Closed Systems: The Coffee Automaton," Scott Aaronson, Sean M. Carroll, Lauren Ouellette. arXiv:1405.6903, 2014. Source code for the original automaton is available at scottaaronson.com.
Key references from the paper. Bennett (1995) on logical depth. Gacs, Tromp, Vitanyi (2001) on algorithmic statistics. Shalizi, Shalizi, Haslinger (2004) on light-cone complexity. Antunes and Fortnow (2009) on the equivalence of coarse sophistication and Busy Beaver depth. Carroll (2010), From Eternity to Here, for the cosmological context.