A manifesto arguing that deep learning is on the verge of a Newtonian moment — where empirical regularities crystallize into a genuine scientific theory, organized around five independent lines of evidence.
Deep learning works. This is not in dispute. GPT-4 writes code, AlphaFold folds proteins, diffusion models generate photorealistic images. The collective engineering achievement is staggering. But here is the strange part: we built all of this before we understood why it works. We are, in the words of this paper, like the 18th-century engineers who built efficient steam engines before thermodynamics existed to explain them.
James Watt improved the steam engine empirically — try this, measure that, iterate. His engines worked. But he could not tell you why a hotter steam source produced more work per unit fuel, or what the fundamental limit on efficiency was. That understanding came 40 years later when Carnot derived the entropy-based efficiency bound. And once the theory existed, it didn't just explain what we already knew — it predicted new things, guided new designs, and became the backbone of all subsequent thermodynamic engineering.
The authors argue deep learning is at the Watt moment. We have the engines. We are missing the Carnot. Practitioners rely on lore — "use Adam, not SGD," "batch norm helps," "deeper is better with residuals" — but these rules are folklore without derivation. They work until they don't, and we can't always tell in advance which situation we're in. The paper's thesis: a genuine scientific theory of deep learning — what they call Learning Mechanics — is now within reach.
Theory requires measurement. The good news is that deep learning is uniquely well-instrumented. Every training run is a controlled experiment: fixed architecture, fixed dataset, fixed optimizer, deterministic (up to seeds) dynamics. Unlike biological neural circuits or social systems, we can replay the experiment exactly. We can intervene — change one hyperparameter, rerun — and measure the causal effect. This makes deep learning amenable to the scientific method in a way that many complex systems are not.
The bad news is scale. A modern LLM has 70 billion parameters. Its loss landscape lives in a 70-billion-dimensional space. Its training trajectory is a path through that space over millions of steps. Describing this system fully is hopeless. But we don't need a full description — we need the right description. Thermodynamics didn't track every molecule; it found the macroscopic variables (temperature, pressure, entropy) that captured the relevant physics. Learning Mechanics needs to find the analogous macrostates for neural networks.
Hover over each era to see the parallel between thermodynamics history and the emerging theory of deep learning.
Before building a theory, we need to agree on what a theory should do. The authors propose seven desiderata — criteria that Learning Mechanics must satisfy to be genuinely useful. These are not arbitrary; they're distilled from the history of successful scientific theories in physics, and each one addresses a specific failure mode of existing attempts to theorize deep learning.
The first three are epistemic: the theory must be fundamental (deriving behavior from basic principles, not just renaming observations), mathematical (with equations that make quantitative predictions, not just qualitative stories), and predictive (making correct statements about systems not yet observed, not just fitting known data). These are the core requirements of any scientific theory. Existing DL theory often fails here — "the model learns features" is fundamental-sounding but unmathematical; "generalization gap is bounded by complexity" is mathematical but rarely predictive in practice.
The next three are about scope and style: comprehensive (covering the whole training pipeline — architecture, optimizer, data, generalization — not just one slice), intuitive (building mechanistic understanding, not just curve-fitting), and useful (guiding real practitioner decisions, like which architecture to use or how to set the learning rate). The final desideratum is perhaps the most important and most often violated: humble. The theory should know what it doesn't cover, explicitly declaring its assumptions and domain of validity. Overreaching theories — claiming to explain all of deep learning when they only apply to two-layer linear nets — are worse than no theory at all, because they mislead.
PAC (Probably Approximately Correct) learning is the classical statistical learning theory. It gives bounds: "with high probability, a model trained on N examples will generalize." These bounds are mathematically rigorous and general. But they spectacularly fail the predictive desideratum for modern deep learning. A ResNet-50 has 25 million parameters, trained on 1.2 million ImageNet examples — classical PAC bounds predict essentially no generalization, since the model is massively overparameterized. Yet it generalizes beautifully. PAC theory describes a worst-case universe; real deep learning lives in a best-case slice of that universe that the theory doesn't characterize.
Learning Mechanics aims to be a mechanistic theory of that best-case slice — explaining why SGD with weight decay finds generalizing solutions, not just proving that such solutions exist in principle. The distinction mirrors physics vs. mathematics: a mathematician can prove solutions to the Navier-Stokes equations exist; a physicist wants to compute what the turbulence actually looks like.
Each row shows a physical concept and its Learning Mechanics analog. Click a row to see the mathematical connection.
A theory is only as good as its solvable cases. Classical mechanics began with simple pendulums and point masses — idealized settings where the math closes. From those, it built intuition for the general case. Learning Mechanics follows the same strategy: find simplified models of neural networks where training dynamics can be computed exactly, then use those exact solutions to build intuition about realistic networks.
The most productive solvable setting is the deep linear network — a neural network with multiple layers but no nonlinearities. It sounds trivially simple (it's just a matrix product), but its training dynamics under gradient flow are surprisingly rich. Saxe, McClelland, and Ganguli (2014) showed that gradient flow on a deep linear network performs an implicit form of greedy low-rank matrix factorization: the singular values of the weight product matrix are learned one at a time, largest first. Specifically, if the target matrix has singular values σ₁ ≥ σ₂ ≥ ... ≥ σₙ, then at time t during training, the learned matrix's singular values follow a sigmoid-like trajectory, with σ₁ switching on first, then σ₂, and so on.
Why does this matter? Because it shows that even without nonlinearities, gradient descent has nontrivial inductive biases — it doesn't learn all components of the target simultaneously. It prioritizes structure. The order in which singular values are learned depends on both their magnitude and on depth: deeper networks learn faster (in units of gradient steps), but the sequential order is preserved. This is a clean, mathematical statement about learning dynamics that the theory can then test against nonlinear networks.
The second key solvable setting is the Neural Tangent Kernel (NTK) regime, introduced by Jacot, Gabriel, and Hongler (2018). The idea: a neural network f(x; θ) can be linearized around its initial parameters θ₀. The linear approximation is f_lin(x; θ) = f(x; θ₀) + ∇_θf(x; θ₀)ᵀ(θ - θ₀). In this linear model, gradient descent becomes exactly solvable — the dynamics are a linear ODE, and the solution can be written in closed form using the kernel K(x, x') = ∇_θf(x; θ₀)ᵀ∇_θf(x'; θ₀), the NTK.
In very wide networks (infinite width limit), the NTK doesn't change during training — the weights move so little relative to the scale of the random initialization that the Jacobian ∇_θf stays approximately constant. This is the lazy training regime. In this regime, the network behaves like a kernel method, and its generalization can be analyzed using kernel theory. The NTK gave a mathematically complete theory of infinitely wide networks — predictive, rigorous, and exact. The problem: real networks are not infinitely wide, and in the NTK regime they don't learn useful features. The NTK is a solvable idealization, not a description of practice.
Watch the singular values of the learned weight product switch on one by one, largest first. Adjust depth to see how it speeds up convergence.
When the exact analysis of deep linear networks or NTK becomes intractable for realistic architectures, Learning Mechanics turns to simplifying limits — extreme parameter regimes where the theory simplifies dramatically. The most important such limit is the lazy vs. rich dichotomy, controlled by a single parameter α that governs the scale of network outputs at initialization.
Write the network as f(x; θ) = α · g(x; θ), where g is a standard parameterized network and α scales the output. When α is large, the loss is large at initialization, so the gradients are large, and the weights change a lot per step. But here's the counterintuitive part: large α also means the network function changes rapidly relative to the change in parameters — mathematically, the network enters the lazy training (NTK) regime. The NTK stays approximately constant, features don't adapt, and the model behaves like a kernel method. When α is small, the opposite happens: the network is in the rich or mean-field regime, where the NTK evolves during training, features adapt, and the network learns genuinely new representations rather than just re-weighting fixed features.
This dichotomy matters enormously in practice because the two regimes generalize differently. Lazy networks generalize like kernel methods — well when the target function is in the RKHS of the NTK, poorly otherwise. Rich networks can learn features aligned with the data structure, potentially achieving much better generalization on low-dimensional tasks. The transition between lazy and rich is sharp (like a phase transition), and the location of the transition depends on α, width, and learning rate in predictable ways that the theory can compute.
One of the more striking claims in the paper is the Discretization Hypothesis: that SGD (stochastic gradient descent, which updates on minibatches) is not just a noisy approximation of full-batch gradient descent, but qualitatively different in a way that promotes generalization. The hypothesis is that SGD noise acts like a symmetry-breaking field that biases the optimizer toward solutions with discrete structure — weight matrices with low effective rank, representations that cluster into discrete prototypes (neural collapse), circuits that implement clean Boolean functions.
The evidence is circumstantial but striking: models trained with larger batch sizes (less SGD noise) generalize worse, even when learning rate is adjusted to compensate. The discrete structures that emerge in trained networks (such as the Fourier-basis modular arithmetic circuits found in mechanistic interpretability) look like they arose from a discretizing pressure, not from gradient descent alone. The hypothesis provides a unified story for why neural collapse, grokking, and low-rank structure all emerge from the same training procedure.
Two networks fit a 1D target function. Adjust α to move between lazy (NTK, fixed features) and rich (adaptive features) regimes and see how the learned function changes.
The strongest evidence that a scientific theory of deep learning is forming is the discovery of quantitative empirical laws — equations with fitted constants that hold across orders of magnitude of variation. Thermodynamics didn't begin with derivations from statistical mechanics; it began with Boyle's Law (PV = constant at fixed T) and Charles's Law (V ∝ T at fixed P). These were measured regularities before they were explained. Learning Mechanics has found its own version of these laws, and they are surprisingly tight.
The most famous is the neural scaling law, first measured systematically by Kaplan et al. (2020). Train an autoregressive language model of varying sizes on varying amounts of data, and measure the test loss. The result: L(N) ≈ α · N^{-β} + L∞, where N is the number of model parameters, α and β are fitted constants (β ≈ 0.07 for language), and L∞ is the irreducible entropy of the data. This power law holds over five orders of magnitude in N — from 10⁵ to 10¹⁰ parameters. It holds for compute, data, and parameters separately. It holds across architectures and datasets. The law is so consistent that practitioners now use it for extrapolation: fit the law on small models, predict the loss at GPT-4 scale.
The second major empirical law is the Edge of Stability phenomenon, discovered by Cohen et al. (2021). Train a network with gradient descent at learning rate η. The sharpness of the loss landscape (largest eigenvalue of the Hessian, λ_max) does not converge to a small value — instead, it climbs during training until it stabilizes near exactly 2/η. This is striking because 2/η is the exact threshold at which gradient descent becomes unstable for a quadratic loss (unstable when λ_max · η > 2). The network sits perpetually at this edge, neither diverging nor reducing sharpness further. This is a quantitative prediction: double the learning rate, and the equilibrium sharpness doubles.
A third law, neural collapse, discovered by Papyan, Han, and Donoho (2020), describes the geometry of representations in the final layer of a classifier at the end of training. As training continues past the point of zero training error (terminal phase), the within-class variability of last-layer representations collapses to zero, the class means form a maximally spread simplex equiangular tight frame (ETF), and the classifier weights align with the class means. This is a precise geometric prediction about a quantity that, naively, could be anything. It holds across datasets, architectures, and loss functions. Neural collapse is now used as a diagnostic for training quality and as a design principle for loss functions.
Left: neural scaling law on log-log axes — drag to explore the power-law fit. Right: edge of stability — adjust learning rate η to see how equilibrium sharpness tracks 2/η.
One of the most practically consequential areas where Learning Mechanics is making progress is hyperparameter transfer — the problem of setting learning rates and other optimizer parameters for large models. The current practice is brutal: tune on a small proxy model, hope the optimal hyperparameters transfer to the large model, and burn enormous compute when they don't. The Maximal Update Parameterization (µP), developed by Greg Yang and colleagues (2022), provides a theoretical framework that makes transfer provably correct under certain assumptions.
The core insight of µP is that standard parameterizations scale hyperparameters badly with width. In a standard MLP with hidden width W, the optimal learning rate scales as 1/W — as you widen the network, you must shrink the learning rate proportionally. This means hyperparameters tuned at width 128 are completely wrong at width 4096. µP fixes this by reparameterizing the network so that the optimal learning rate is width-independent. Concretely, if the standard parameterization has embedding weights scaled as 1/√W, µP scales them as 1/W and adjusts the learning rate to compensate. The result: the effective update magnitude to each neuron stays constant as width grows, and the optimal learning rate η* stays constant across widths.
The practical implication is striking. The linear scaling rule for learning rate (popular in distributed training) says: if you increase batch size by k, multiply learning rate by k. This works empirically for moderate batch sizes. µP provides the theoretical explanation: both the linear scaling rule and µP's width transfer rule come from the same underlying principle — keeping the ratio of gradient signal to parameter scale constant. µP extends this principle to architecture changes, not just batch size changes.
Beyond µP, Learning Mechanics has identified a subtler hyperparameter effect: implicit curvature regularization. When you train with gradient descent at finite step size η (as opposed to infinitesimal gradient flow), the optimizer implicitly minimizes not just the loss L(θ) but the modified loss L(θ) + (η/4) · ||∇L(θ)||². The extra term penalizes gradient norm — it's a regularizer that biases the solution toward flat minima. This is not a designed regularizer; it emerges from the discretization of continuous gradient flow by finite step size. The effect is: larger η → stronger flatness regularization → better generalization (up to a point). This explains empirically observed phenomena like "warm-up" learning rate schedules improving generalization even when final learning rate is the same.
The blue line (standard parameterization) shows how optimal LR drops as width grows. The orange line (µP) stays flat — tune once, transfer anywhere.
One of the most striking empirical discoveries supporting Learning Mechanics is universality: different neural networks trained independently on different datasets converge to strikingly similar representations. Train a ResNet on ImageNet and a ViT on LAION-5B. Take any image, compute its representation in both networks. The representations are not identical, but they are linearly related — there exists a matrix M such that, for most images, ResNet_repr ≈ M · ViT_repr. This linear alignment is far too strong to be coincidence. It suggests that these architectures, despite different sizes and training procedures, are discovering the same underlying structure in the visual world.
Huh, Cheung, Bernstein, and Isola (2024) formalized this as the Platonic Representation Hypothesis: as models get larger and are trained on more data, their representations converge toward a shared "Platonic" model of reality — a compressed statistical model of the world that is architecture-independent. The evidence is their cross-dataset, cross-architecture similarity metric, which shows that larger models are more similar to each other across architecture families than smaller models are. Convergence with scale is the key signature.
Universality also appears at the data level. Training datasets drawn from very different sources share deep statistical regularities: Zipf's law for token frequencies (frequency ∝ 1/rank), power-law spectral structure in natural images (Fourier amplitude ∝ 1/frequency), and heavy-tailed word co-occurrence statistics. These are not coincidences of data curation — they reflect fundamental properties of how information is structured in the physical and social world. Learning Mechanics asks: if the data universally has power-law structure, do the representations that emerge from learning that data also have universal structure? The evidence increasingly says yes.
Universality is deeply important for Learning Mechanics because it suggests that there exist theory-level abstractions that describe all well-trained networks, regardless of architecture details. Just as thermodynamic laws describe all gases regardless of what molecules compose them, Learning Mechanics could describe all well-trained networks regardless of whether they are MLPs, CNNs, or transformers. The architecture specifics would be the equivalent of molecular details — relevant for some questions, irrelevant for the macroscopic laws.
The flip side: universality also means that representational similarity across datasets is a diagnostic. If two models trained on different datasets have similar cross-dataset similarity scores, they've learned similar world models. If a new architecture scores low on cross-dataset similarity, it's learning something idiosyncratic — possibly a red flag. This gives practitioners a new tool for model evaluation that goes beyond test loss.
Four model families (ResNet, ViT, MLP-Mixer, ConvNeXt) trained independently. Adjust model size to watch their representations converge toward a shared structure.
Learning Mechanics and Mechanistic Interpretability (MI) are often discussed separately — LM as theory, MI as empirical reverse-engineering. But the paper argues they are in a symbiotic relationship analogous to physics and biology. Physics provides the laws (thermodynamics, electromagnetism) that constrain what biological systems can do, even though biologists rarely derive their findings from first principles. And biology gives physicists new phenomena to explain — cell membranes, action potentials, flocking behavior — that extend and stress-test physical theory. LM and MI stand in exactly this relation.
Learning Mechanics provides the formal scaffolding that MI needs to move from cataloguing findings to building mechanistic predictions. MI has discovered induction heads — attention heads that perform in-context learning via a "copy previous token when current token matches" circuit. But why do induction heads form? Under what conditions? At what scale? MI can measure their presence, but LM should explain their necessity. The LM perspective: induction heads are a low-complexity solution to a high-frequency pattern in natural language, and the Discretization Hypothesis predicts that SGD noise will discover and reinforce this solution preferentially. That's a testable prediction.
Conversely, MI gives LM phenomena to explain. MI has found that models trained on modular arithmetic learn to represent numbers as Fourier modes, use those modes to perform "clock arithmetic," and then grok — generalize suddenly after a long plateau of memorization. This is a concrete, measurable phenomenon with a clear structure. LM's job is to derive this from first principles: why Fourier modes? Why grokking (sudden generalization)? Why at that training step? The MI finding constrains and guides the theory in a way that scaling law experiments alone cannot.
The LM ↔ MI relationship maps onto the three-level description hierarchy familiar from cognitive science: computational (what does the system do?), algorithmic (what procedure does it use?), and implementation (how is it physically instantiated?). MI operates primarily at the algorithmic and implementation levels — it finds circuits and identifies the computations they perform. LM operates at the computational and algorithmic levels — it asks what computations gradient descent is trying to perform and why. The two together provide a complete description.
The three levels of description in science, and how Learning Mechanics and Mechanistic Interpretability map onto them. Click each level to expand the analogy.
A manifesto for a new scientific theory would be incomplete without confronting the strongest objections. The authors take this seriously — they dedicate a full section to four objections to Learning Mechanics, each representing a genuine concern that has been raised by serious researchers. The objections are not strawmen, and the responses are not dismissals. Understanding them sharpens the theory's claims.
Objection 1: Deep learning is too complex for theory. The argument: with billions of parameters, millions of training steps, and proprietary data, the system is too high-dimensional and heterogeneous for any tractable theory. Response: thermodynamics faces the same complexity objection (10²³ molecules), yet macroscopic laws emerge. The argument proves too much — it would also rule out theoretical ecology, macroeconomics, and climate science. The key is finding the right coarse-grained variables. The five evidence lines suggest those variables exist for deep learning (sharpness, scaling exponents, feature similarity scores).
Objection 2: Empirical laws are curve-fitting, not understanding. The argument: neural scaling laws are power-law fits to data. Power laws appear everywhere in complex systems (earthquake sizes, city populations, word frequencies). Finding one in neural loss curves might be coincidence or data artifact, not a deep fact about learning. Response: the scaling laws are predictive over 5 orders of magnitude and across architectures. They predicted GPT-3's performance before it was trained. Ptolemy's epicycles fit planetary motion but couldn't predict new planets; Newton's laws could. Scaling laws are more Newton than Ptolemy, but the objection is valid — we need the mechanistic explanation of why β ≈ 0.07 for language.
Objection 3: Modern systems are too heterogeneous. Theory derived from simple settings (linear networks, infinite-width limits) may not extend to RLHF-finetuned LLMs with mixture-of-experts and custom attention masks. Response: this is a real concern. Learning Mechanics explicitly claims only partial coverage — the humble desideratum means acknowledging what the theory doesn't yet explain. The current theory covers pretraining dynamics well; finetuning, RLHF, and multi-modal training are active frontiers. This is progress, not failure.
Objection 4: A theory developed for today's architectures will be obsolete when architectures change. If state-space models replace transformers, or if some new training paradigm emerges, won't all the transformer-specific theory be wasted? Response: the theory's targets are architecture-independent where possible. Scaling laws, edge of stability, and µP hold across transformers and MLPs. Universality holds across architectures. The Discretization Hypothesis applies to any discrete-time optimizer. Architecture-specific findings (like transformer-specific phase transitions) are explicitly scoped as such.
Click each objection card to see the Learning Mechanics response.
Learning Mechanics is not one paper — it is a convergent research program across many groups. The five evidence lines (solvable settings, simplifying limits, empirical laws, hyperparameter transfer, universality) each represent years of independent work that the paper synthesizes into a coherent framework. Understanding the program means knowing the key papers and how they connect, and knowing what remains open.
The paper identifies ten open directions for Learning Mechanics. Each represents a major gap between current theory and practice:
| Paper | Year | Contribution | Evidence Line |
|---|---|---|---|
| Saxe, McClelland, Ganguli | 2014 | Deep linear network exact dynamics — sequential SV learning | Solvable settings |
| Jacot, Gabriel, Hongler | 2018 | Neural Tangent Kernel — infinite-width exact solution | Solvable settings |
| Kaplan et al. | 2020 | Neural scaling laws — L = αN^{-β} + L∞ over 5 orders of magnitude | Empirical laws |
| Cohen et al. | 2021 | Edge of Stability — sharpness tracks 2/η during training | Empirical laws |
| Yang et al. (µP) | 2022 | Maximal Update Parameterization — width-independent hyperparameter transfer | Hyperparameter transfer |
| Huh, Cheung, Bernstein, Isola | 2024 | Platonic Representation Hypothesis — cross-architecture convergence | Universality |
The five evidence lines and how they connect to each other and to practice. Lines show theoretical dependencies.