Learning Mechanics: A Scientific Theory of Deep Learning

Chapter 0: The Problem

Deep learning works. This is not in dispute. GPT-4 writes code, AlphaFold folds proteins, diffusion models generate photorealistic images. The collective engineering achievement is staggering. But here is the strange part: we built all of this before we understood why it works. We are, in the words of this paper, like the 18th-century engineers who built efficient steam engines before thermodynamics existed to explain them.

James Watt improved the steam engine empirically — try this, measure that, iterate. His engines worked. But he could not tell you why a hotter steam source produced more work per unit fuel, or what the fundamental limit on efficiency was. That understanding came 40 years later when Carnot derived the entropy-based efficiency bound. And once the theory existed, it didn't just explain what we already knew — it predicted new things, guided new designs, and became the backbone of all subsequent thermodynamic engineering.

The authors argue deep learning is at the Watt moment. We have the engines. We are missing the Carnot. Practitioners rely on lore — "use Adam, not SGD," "batch norm helps," "deeper is better with residuals" — but these rules are folklore without derivation. They work until they don't, and we can't always tell in advance which situation we're in. The paper's thesis: a genuine scientific theory of deep learning — what they call Learning Mechanics — is now within reach.

What makes deep learning measurable (and thus theorizable)

Theory requires measurement. The good news is that deep learning is uniquely well-instrumented. Every training run is a controlled experiment: fixed architecture, fixed dataset, fixed optimizer, deterministic (up to seeds) dynamics. Unlike biological neural circuits or social systems, we can replay the experiment exactly. We can intervene — change one hyperparameter, rerun — and measure the causal effect. This makes deep learning amenable to the scientific method in a way that many complex systems are not.

The bad news is scale. A modern LLM has 70 billion parameters. Its loss landscape lives in a 70-billion-dimensional space. Its training trajectory is a path through that space over millions of steps. Describing this system fully is hopeless. But we don't need a full description — we need the right description. Thermodynamics didn't track every molecule; it found the macroscopic variables (temperature, pressure, entropy) that captured the relevant physics. Learning Mechanics needs to find the analogous macrostates for neural networks.

The steam engine analogy: Watt's engine worked before Carnot's theory. GPT-4 works before Learning Mechanics. But Carnot's theory enabled the internal combustion engine, the refrigerator, and the Rankine cycle. Theory doesn't just explain — it enables. A scientific theory of deep learning would let us design architectures the way we design turbines: from first principles, with predictable performance guarantees.

Steam Engine → Thermodynamics → Deep Learning Timeline

Hover over each era to see the parallel between thermodynamics history and the emerging theory of deep learning.

Why is deep learning particularly amenable to a scientific theory, unlike, say, economics?

Training runs are fully controlled experiments — fixed architecture, dataset, and optimizer — and can be replayed exactly, enabling true causal measurement Deep learning models are smaller and simpler than economic systems We already understand the math of gradient descent completely

Chapter 1: Seven Desiderata

Before building a theory, we need to agree on what a theory should do. The authors propose seven desiderata — criteria that Learning Mechanics must satisfy to be genuinely useful. These are not arbitrary; they're distilled from the history of successful scientific theories in physics, and each one addresses a specific failure mode of existing attempts to theorize deep learning.

The first three are epistemic: the theory must be fundamental (deriving behavior from basic principles, not just renaming observations), mathematical (with equations that make quantitative predictions, not just qualitative stories), and predictive (making correct statements about systems not yet observed, not just fitting known data). These are the core requirements of any scientific theory. Existing DL theory often fails here — "the model learns features" is fundamental-sounding but unmathematical; "generalization gap is bounded by complexity" is mathematical but rarely predictive in practice.

The next three are about scope and style: comprehensive (covering the whole training pipeline — architecture, optimizer, data, generalization — not just one slice), intuitive (building mechanistic understanding, not just curve-fitting), and useful (guiding real practitioner decisions, like which architecture to use or how to set the learning rate). The final desideratum is perhaps the most important and most often violated: humble. The theory should know what it doesn't cover, explicitly declaring its assumptions and domain of validity. Overreaching theories — claiming to explain all of deep learning when they only apply to two-layer linear nets — are worse than no theory at all, because they mislead.

How this differs from PAC learning

PAC (Probably Approximately Correct) learning is the classical statistical learning theory. It gives bounds: "with high probability, a model trained on N examples will generalize." These bounds are mathematically rigorous and general. But they spectacularly fail the predictive desideratum for modern deep learning. A ResNet-50 has 25 million parameters, trained on 1.2 million ImageNet examples — classical PAC bounds predict essentially no generalization, since the model is massively overparameterized. Yet it generalizes beautifully. PAC theory describes a worst-case universe; real deep learning lives in a best-case slice of that universe that the theory doesn't characterize.

Learning Mechanics aims to be a mechanistic theory of that best-case slice — explaining why SGD with weight decay finds generalizing solutions, not just proving that such solutions exist in principle. The distinction mirrors physics vs. mathematics: a mathematician can prove solutions to the Navier-Stokes equations exist; a physicist wants to compute what the turbulence actually looks like.

Physics ↔ DL parallels: A ball rolling in a potential well obeys F = -∇V. A neural network trained by gradient descent obeys θ ← θ - η∇L. The mathematics is identical. Temperature smooths the ball's trajectory (Langevin dynamics); learning rate smooths the network's trajectory (SGD noise). Equilibrium in the well = generalization in the network. Learning Mechanics exploits these parallels systematically.

Physics ↔ Deep Learning Conceptual Parallels

Each row shows a physical concept and its Learning Mechanics analog. Click a row to see the mathematical connection.

What distinguishes Learning Mechanics from PAC learning theory?

PAC gives worst-case generalization bounds that fail for overparameterized networks; LM aims for mechanistic predictions of why SGD finds good solutions in practice Learning Mechanics is more mathematically rigorous than PAC PAC theory doesn't apply to neural networks at all

Chapter 2: Solvable Settings

A theory is only as good as its solvable cases. Classical mechanics began with simple pendulums and point masses — idealized settings where the math closes. From those, it built intuition for the general case. Learning Mechanics follows the same strategy: find simplified models of neural networks where training dynamics can be computed exactly, then use those exact solutions to build intuition about realistic networks.

The most productive solvable setting is the deep linear network — a neural network with multiple layers but no nonlinearities. It sounds trivially simple (it's just a matrix product), but its training dynamics under gradient flow are surprisingly rich. Saxe, McClelland, and Ganguli (2014) showed that gradient flow on a deep linear network performs an implicit form of greedy low-rank matrix factorization: the singular values of the weight product matrix are learned one at a time, largest first. Specifically, if the target matrix has singular values σ₁ ≥ σ₂ ≥ ... ≥ σₙ, then at time t during training, the learned matrix's singular values follow a sigmoid-like trajectory, with σ₁ switching on first, then σ₂, and so on.

Why does this matter? Because it shows that even without nonlinearities, gradient descent has nontrivial inductive biases — it doesn't learn all components of the target simultaneously. It prioritizes structure. The order in which singular values are learned depends on both their magnitude and on depth: deeper networks learn faster (in units of gradient steps), but the sequential order is preserved. This is a clean, mathematical statement about learning dynamics that the theory can then test against nonlinear networks.

The Neural Tangent Kernel linearization

The second key solvable setting is the Neural Tangent Kernel (NTK) regime, introduced by Jacot, Gabriel, and Hongler (2018). The idea: a neural network f(x; θ) can be linearized around its initial parameters θ₀. The linear approximation is f_lin(x; θ) = f(x; θ₀) + ∇_θf(x; θ₀)ᵀ(θ - θ₀). In this linear model, gradient descent becomes exactly solvable — the dynamics are a linear ODE, and the solution can be written in closed form using the kernel K(x, x') = ∇_θf(x; θ₀)ᵀ∇_θf(x'; θ₀), the NTK.

In very wide networks (infinite width limit), the NTK doesn't change during training — the weights move so little relative to the scale of the random initialization that the Jacobian ∇_θf stays approximately constant. This is the lazy training regime. In this regime, the network behaves like a kernel method, and its generalization can be analyzed using kernel theory. The NTK gave a mathematically complete theory of infinitely wide networks — predictive, rigorous, and exact. The problem: real networks are not infinitely wide, and in the NTK regime they don't learn useful features. The NTK is a solvable idealization, not a description of practice.

Why largest singular values first? Gradient flow on a linear network with small initialization is equivalent to a gradient flow on the Bures metric manifold of low-rank matrices. The effective "attraction" toward the target's i-th singular component scales as σᵢ times the current component magnitude. Large σᵢ creates stronger pull, causing those components to escape zero first. It's a self-reinforcing process: small random fluctuations at initialization break symmetry, the largest-σ component grows first, and the exponential dynamics make it dominate before smaller components have time to grow. Depth amplifies this cascade.

Deep Linear Net: Sequential Singular Value Learning

Watch the singular values of the learned weight product switch on one by one, largest first. Adjust depth to see how it speeds up convergence.

Depth2

In a deep linear network trained by gradient flow, why are the largest singular values of the target matrix learned first?

The gradient signal for each singular component scales with its magnitude — larger singular values create stronger self-reinforcing growth dynamics that cause them to escape the zero fixed point first The optimizer explicitly prioritizes high-energy directions Larger singular values correspond to more data, so the gradient is larger

Chapter 3: Simplifying Limits

When the exact analysis of deep linear networks or NTK becomes intractable for realistic architectures, Learning Mechanics turns to simplifying limits — extreme parameter regimes where the theory simplifies dramatically. The most important such limit is the lazy vs. rich dichotomy, controlled by a single parameter α that governs the scale of network outputs at initialization.

Write the network as f(x; θ) = α · g(x; θ), where g is a standard parameterized network and α scales the output. When α is large, the loss is large at initialization, so the gradients are large, and the weights change a lot per step. But here's the counterintuitive part: large α also means the network function changes rapidly relative to the change in parameters — mathematically, the network enters the lazy training (NTK) regime. The NTK stays approximately constant, features don't adapt, and the model behaves like a kernel method. When α is small, the opposite happens: the network is in the rich or mean-field regime, where the NTK evolves during training, features adapt, and the network learns genuinely new representations rather than just re-weighting fixed features.

This dichotomy matters enormously in practice because the two regimes generalize differently. Lazy networks generalize like kernel methods — well when the target function is in the RKHS of the NTK, poorly otherwise. Rich networks can learn features aligned with the data structure, potentially achieving much better generalization on low-dimensional tasks. The transition between lazy and rich is sharp (like a phase transition), and the location of the transition depends on α, width, and learning rate in predictable ways that the theory can compute.

The Discretization Hypothesis

One of the more striking claims in the paper is the Discretization Hypothesis: that SGD (stochastic gradient descent, which updates on minibatches) is not just a noisy approximation of full-batch gradient descent, but qualitatively different in a way that promotes generalization. The hypothesis is that SGD noise acts like a symmetry-breaking field that biases the optimizer toward solutions with discrete structure — weight matrices with low effective rank, representations that cluster into discrete prototypes (neural collapse), circuits that implement clean Boolean functions.

The evidence is circumstantial but striking: models trained with larger batch sizes (less SGD noise) generalize worse, even when learning rate is adjusted to compensate. The discrete structures that emerge in trained networks (such as the Fourier-basis modular arithmetic circuits found in mechanistic interpretability) look like they arose from a discretizing pressure, not from gradient descent alone. The hypothesis provides a unified story for why neural collapse, grokking, and low-rank structure all emerge from the same training procedure.

Lazy vs Rich in one sentence: Large α → NTK regime (lazy) → features frozen → kernel method. Small α → mean-field regime (rich) → features learn → representation learning. The parameter α is the "phase dial" between two fundamentally different learning behaviors, and most practical networks sit somewhere in between — closer to rich for transformers, closer to lazy for very large MLPs.

Lazy vs Rich: Function Fitting Under α

Two networks fit a 1D target function. Adjust α to move between lazy (NTK, fixed features) and rich (adaptive features) regimes and see how the learned function changes.

α (output scale)10

What is the Discretization Hypothesis?

SGD noise is not just approximation error but a qualitative force that biases networks toward solutions with discrete structure — low-rank weights, clustered representations, and clean circuits Discrete activation functions generalize better than continuous ones Neural networks discretize continuous input spaces into learned categories

Chapter 4: Empirical Laws

The strongest evidence that a scientific theory of deep learning is forming is the discovery of quantitative empirical laws — equations with fitted constants that hold across orders of magnitude of variation. Thermodynamics didn't begin with derivations from statistical mechanics; it began with Boyle's Law (PV = constant at fixed T) and Charles's Law (V ∝ T at fixed P). These were measured regularities before they were explained. Learning Mechanics has found its own version of these laws, and they are surprisingly tight.

The most famous is the neural scaling law, first measured systematically by Kaplan et al. (2020). Train an autoregressive language model of varying sizes on varying amounts of data, and measure the test loss. The result: L(N) ≈ α · N^{-β} + L∞, where N is the number of model parameters, α and β are fitted constants (β ≈ 0.07 for language), and L∞ is the irreducible entropy of the data. This power law holds over five orders of magnitude in N — from 10⁵ to 10¹⁰ parameters. It holds for compute, data, and parameters separately. It holds across architectures and datasets. The law is so consistent that practitioners now use it for extrapolation: fit the law on small models, predict the loss at GPT-4 scale.

The second major empirical law is the Edge of Stability phenomenon, discovered by Cohen et al. (2021). Train a network with gradient descent at learning rate η. The sharpness of the loss landscape (largest eigenvalue of the Hessian, λ_max) does not converge to a small value — instead, it climbs during training until it stabilizes near exactly 2/η. This is striking because 2/η is the exact threshold at which gradient descent becomes unstable for a quadratic loss (unstable when λ_max · η > 2). The network sits perpetually at this edge, neither diverging nor reducing sharpness further. This is a quantitative prediction: double the learning rate, and the equilibrium sharpness doubles.

Neural collapse: geometry at the end of training

A third law, neural collapse, discovered by Papyan, Han, and Donoho (2020), describes the geometry of representations in the final layer of a classifier at the end of training. As training continues past the point of zero training error (terminal phase), the within-class variability of last-layer representations collapses to zero, the class means form a maximally spread simplex equiangular tight frame (ETF), and the classifier weights align with the class means. This is a precise geometric prediction about a quantity that, naively, could be anything. It holds across datasets, architectures, and loss functions. Neural collapse is now used as a diagnostic for training quality and as a design principle for loss functions.

The power of quantitative laws: "Bigger models do better" is an observation. "L(N) = α·N^{-β} + L∞ with β ≈ 0.07 and fit over 5 orders of magnitude" is a law. The difference is predictive precision. Kaplan et al. used the scaling law to predict that GPT-3 (175B params) would reach a specific loss before training it — and they were right. That's what makes scaling laws scientifically meaningful: falsifiable quantitative prediction, not post-hoc description.

Scaling Laws + Edge of Stability (Showcase)

Left: neural scaling law on log-log axes — drag to explore the power-law fit. Right: edge of stability — adjust learning rate η to see how equilibrium sharpness tracks 2/η.

Learning rate η0.10

If you double the learning rate η in gradient descent, what does the Edge of Stability law predict happens to the equilibrium sharpness λ_max?

Equilibrium sharpness doubles — since λ_max stabilizes near 2/η, halving η means halving sharpness, and doubling η means doubling sharpness Sharpness decreases because larger steps escape sharp regions Sharpness stays constant because it depends only on architecture

Chapter 5: Hyperparameters

One of the most practically consequential areas where Learning Mechanics is making progress is hyperparameter transfer — the problem of setting learning rates and other optimizer parameters for large models. The current practice is brutal: tune on a small proxy model, hope the optimal hyperparameters transfer to the large model, and burn enormous compute when they don't. The Maximal Update Parameterization (µP), developed by Greg Yang and colleagues (2022), provides a theoretical framework that makes transfer provably correct under certain assumptions.

The core insight of µP is that standard parameterizations scale hyperparameters badly with width. In a standard MLP with hidden width W, the optimal learning rate scales as 1/W — as you widen the network, you must shrink the learning rate proportionally. This means hyperparameters tuned at width 128 are completely wrong at width 4096. µP fixes this by reparameterizing the network so that the optimal learning rate is width-independent. Concretely, if the standard parameterization has embedding weights scaled as 1/√W, µP scales them as 1/W and adjusts the learning rate to compensate. The result: the effective update magnitude to each neuron stays constant as width grows, and the optimal learning rate η* stays constant across widths.

The practical implication is striking. The linear scaling rule for learning rate (popular in distributed training) says: if you increase batch size by k, multiply learning rate by k. This works empirically for moderate batch sizes. µP provides the theoretical explanation: both the linear scaling rule and µP's width transfer rule come from the same underlying principle — keeping the ratio of gradient signal to parameter scale constant. µP extends this principle to architecture changes, not just batch size changes.

Implicit curvature regularization

Beyond µP, Learning Mechanics has identified a subtler hyperparameter effect: implicit curvature regularization. When you train with gradient descent at finite step size η (as opposed to infinitesimal gradient flow), the optimizer implicitly minimizes not just the loss L(θ) but the modified loss L(θ) + (η/4) · ||∇L(θ)||². The extra term penalizes gradient norm — it's a regularizer that biases the solution toward flat minima. This is not a designed regularizer; it emerges from the discretization of continuous gradient flow by finite step size. The effect is: larger η → stronger flatness regularization → better generalization (up to a point). This explains empirically observed phenomena like "warm-up" learning rate schedules improving generalization even when final learning rate is the same.

µP transfer in practice: Yang et al. (2022) showed that for a 125M-parameter transformer, the optimal learning rate under µP is η* ≈ 0.01. Under standard parameterization, the optimal learning rate at 125M is 0.003, and at 7B it changes to 0.0003. Under µP, the optimal learning rate at 7B is still ≈ 0.01. Tune once at 125M, apply directly to 7B. This saves roughly 10 large-scale training runs that would otherwise be needed for hyperparameter search.

µP vs Standard: Optimal Learning Rate vs Width

The blue line (standard parameterization) shows how optimal LR drops as width grows. The orange line (µP) stays flat — tune once, transfer anywhere.

You tune hyperparameters on a 125M-parameter transformer under µP and find optimal η* = 0.01. You then train the same architecture at 7B parameters, also under µP. What learning rate should you use?

η* ≈ 0.01 — µP ensures the optimal learning rate is width-independent, so the tuned value transfers directly to larger models Scale down proportionally: η* ≈ 0.01 × (125M/7B) ≈ 0.00018 You must re-tune — optimal LR always depends on model size

Chapter 6: Universality

One of the most striking empirical discoveries supporting Learning Mechanics is universality: different neural networks trained independently on different datasets converge to strikingly similar representations. Train a ResNet on ImageNet and a ViT on LAION-5B. Take any image, compute its representation in both networks. The representations are not identical, but they are linearly related — there exists a matrix M such that, for most images, ResNet_repr ≈ M · ViT_repr. This linear alignment is far too strong to be coincidence. It suggests that these architectures, despite different sizes and training procedures, are discovering the same underlying structure in the visual world.

Huh, Cheung, Bernstein, and Isola (2024) formalized this as the Platonic Representation Hypothesis: as models get larger and are trained on more data, their representations converge toward a shared "Platonic" model of reality — a compressed statistical model of the world that is architecture-independent. The evidence is their cross-dataset, cross-architecture similarity metric, which shows that larger models are more similar to each other across architecture families than smaller models are. Convergence with scale is the key signature.

Universality also appears at the data level. Training datasets drawn from very different sources share deep statistical regularities: Zipf's law for token frequencies (frequency ∝ 1/rank), power-law spectral structure in natural images (Fourier amplitude ∝ 1/frequency), and heavy-tailed word co-occurrence statistics. These are not coincidences of data curation — they reflect fundamental properties of how information is structured in the physical and social world. Learning Mechanics asks: if the data universally has power-law structure, do the representations that emerge from learning that data also have universal structure? The evidence increasingly says yes.

What universality implies for theory

Universality is deeply important for Learning Mechanics because it suggests that there exist theory-level abstractions that describe all well-trained networks, regardless of architecture details. Just as thermodynamic laws describe all gases regardless of what molecules compose them, Learning Mechanics could describe all well-trained networks regardless of whether they are MLPs, CNNs, or transformers. The architecture specifics would be the equivalent of molecular details — relevant for some questions, irrelevant for the macroscopic laws.

The flip side: universality also means that representational similarity across datasets is a diagnostic. If two models trained on different datasets have similar cross-dataset similarity scores, they've learned similar world models. If a new architecture scores low on cross-dataset similarity, it's learning something idiosyncratic — possibly a red flag. This gives practitioners a new tool for model evaluation that goes beyond test loss.

Convergence with scale: Small models (100M params) trained on ImageNet and Places365 have low cross-dataset representation similarity — they've specialized to their training distribution. Large models (1B+) trained on diverse data have high cross-dataset similarity — they've converged toward a shared world model. The Platonic Representation Hypothesis says this convergence continues: in the limit of infinite data and parameters, all models would represent the same "reality." Current models are early approximations to this limit.

Representational Convergence Across Architectures

Four model families (ResNet, ViT, MLP-Mixer, ConvNeXt) trained independently. Adjust model size to watch their representations converge toward a shared structure.

Model sizeSmall

What does high cross-dataset representational similarity between two models imply, according to the Platonic Representation Hypothesis?

Both models have converged toward a shared model of the underlying structure of reality, independent of which specific dataset they were trained on The two models were trained on the same data, causing similar representations Similarity implies the models overfit to the same spurious features

Chapter 7: LM ⇄ Mech Interp

Learning Mechanics and Mechanistic Interpretability (MI) are often discussed separately — LM as theory, MI as empirical reverse-engineering. But the paper argues they are in a symbiotic relationship analogous to physics and biology. Physics provides the laws (thermodynamics, electromagnetism) that constrain what biological systems can do, even though biologists rarely derive their findings from first principles. And biology gives physicists new phenomena to explain — cell membranes, action potentials, flocking behavior — that extend and stress-test physical theory. LM and MI stand in exactly this relation.

Learning Mechanics provides the formal scaffolding that MI needs to move from cataloguing findings to building mechanistic predictions. MI has discovered induction heads — attention heads that perform in-context learning via a "copy previous token when current token matches" circuit. But why do induction heads form? Under what conditions? At what scale? MI can measure their presence, but LM should explain their necessity. The LM perspective: induction heads are a low-complexity solution to a high-frequency pattern in natural language, and the Discretization Hypothesis predicts that SGD noise will discover and reinforce this solution preferentially. That's a testable prediction.

Conversely, MI gives LM phenomena to explain. MI has found that models trained on modular arithmetic learn to represent numbers as Fourier modes, use those modes to perform "clock arithmetic," and then grok — generalize suddenly after a long plateau of memorization. This is a concrete, measurable phenomenon with a clear structure. LM's job is to derive this from first principles: why Fourier modes? Why grokking (sudden generalization)? Why at that training step? The MI finding constrains and guides the theory in a way that scaling law experiments alone cannot.

Three levels of description

The LM ↔ MI relationship maps onto the three-level description hierarchy familiar from cognitive science: computational (what does the system do?), algorithmic (what procedure does it use?), and implementation (how is it physically instantiated?). MI operates primarily at the algorithmic and implementation levels — it finds circuits and identifies the computations they perform. LM operates at the computational and algorithmic levels — it asks what computations gradient descent is trying to perform and why. The two together provide a complete description.

What would LM ask about induction heads? Not "do they exist?" (MI answered that) but: "At what critical width W* do induction heads first appear reliably? What loss term drives their formation — is it the next-token prediction loss on rare bigrams, or something else? Does the Discretization Hypothesis predict their discrete circuit structure? Can we derive their formation time from scaling laws on in-context learning performance?" These are LM-style questions that MI's experimental toolkit can answer.

Three-Level Diagram: Physics → Biology → Psychology / LM → MI → Capabilities

The three levels of description in science, and how Learning Mechanics and Mechanistic Interpretability map onto them. Click each level to expand the analogy.

From a Learning Mechanics perspective, what would be the right theoretical question to ask about induction heads (discovered by MI)?

Under what conditions (width, data distribution, training duration) do induction heads necessarily form, and can the Discretization Hypothesis predict their discrete circuit structure from first principles? Whether induction heads are present in all transformer models Whether induction heads can be ablated without hurting performance

Chapter 8: Skepticism

A manifesto for a new scientific theory would be incomplete without confronting the strongest objections. The authors take this seriously — they dedicate a full section to four objections to Learning Mechanics, each representing a genuine concern that has been raised by serious researchers. The objections are not strawmen, and the responses are not dismissals. Understanding them sharpens the theory's claims.

Objection 1: Deep learning is too complex for theory. The argument: with billions of parameters, millions of training steps, and proprietary data, the system is too high-dimensional and heterogeneous for any tractable theory. Response: thermodynamics faces the same complexity objection (10²³ molecules), yet macroscopic laws emerge. The argument proves too much — it would also rule out theoretical ecology, macroeconomics, and climate science. The key is finding the right coarse-grained variables. The five evidence lines suggest those variables exist for deep learning (sharpness, scaling exponents, feature similarity scores).

Objection 2: Empirical laws are curve-fitting, not understanding. The argument: neural scaling laws are power-law fits to data. Power laws appear everywhere in complex systems (earthquake sizes, city populations, word frequencies). Finding one in neural loss curves might be coincidence or data artifact, not a deep fact about learning. Response: the scaling laws are predictive over 5 orders of magnitude and across architectures. They predicted GPT-3's performance before it was trained. Ptolemy's epicycles fit planetary motion but couldn't predict new planets; Newton's laws could. Scaling laws are more Newton than Ptolemy, but the objection is valid — we need the mechanistic explanation of why β ≈ 0.07 for language.

Two more objections

Objection 3: Modern systems are too heterogeneous. Theory derived from simple settings (linear networks, infinite-width limits) may not extend to RLHF-finetuned LLMs with mixture-of-experts and custom attention masks. Response: this is a real concern. Learning Mechanics explicitly claims only partial coverage — the humble desideratum means acknowledging what the theory doesn't yet explain. The current theory covers pretraining dynamics well; finetuning, RLHF, and multi-modal training are active frontiers. This is progress, not failure.

Objection 4: A theory developed for today's architectures will be obsolete when architectures change. If state-space models replace transformers, or if some new training paradigm emerges, won't all the transformer-specific theory be wasted? Response: the theory's targets are architecture-independent where possible. Scaling laws, edge of stability, and µP hold across transformers and MLPs. Universality holds across architectures. The Discretization Hypothesis applies to any discrete-time optimizer. Architecture-specific findings (like transformer-specific phase transitions) are explicitly scoped as such.

Theory development is not linear: Caloric theory (the wrong theory of heat) preceded thermodynamics and was partially useful before being discarded. Phlogiston theory preceded oxygen chemistry. The history of science shows that wrong theories with partial predictive success are essential stepping stones — they make predictions that force experiments that reveal their limits and point toward the right theory. Learning Mechanics doesn't need to be final to be valuable; it needs to be falsifiable and more useful than the current folklore.

Four Objections — Click to Reveal Response

Click each objection card to see the Learning Mechanics response.

Which historical analogy best describes the role of an imperfect but partially predictive theory like Learning Mechanics?

Caloric theory — partially useful and predictive, eventually superseded, but an essential stepping stone that forced the experiments revealing thermodynamics Ptolemy's epicycles — complex curve-fitting with no mechanistic insight Newton's laws — complete and final for its domain

Chapter 9: Connections

Learning Mechanics is not one paper — it is a convergent research program across many groups. The five evidence lines (solvable settings, simplifying limits, empirical laws, hyperparameter transfer, universality) each represent years of independent work that the paper synthesizes into a coherent framework. Understanding the program means knowing the key papers and how they connect, and knowing what remains open.

Key open directions

The paper identifies ten open directions for Learning Mechanics. Each represents a major gap between current theory and practice:

Mechanistic scaling laws: Derive the scaling exponent β from first principles, not just measure it. Why β ≈ 0.07 for language and ≈ 0.11 for images?
Finetuning theory: RLHF, LoRA, and instruction finetuning are poorly understood theoretically. How does LM's solvable-settings approach extend to finetuning dynamics?
Data curation theory: Filtering and deduplication dramatically affect scaling law constants. What determines the data efficiency coefficient α in L = αN^{-β} + L∞?
Emergence theory: Capabilities like in-context learning and chain-of-thought reasoning appear suddenly at certain scales. Are these real phase transitions or measurement artifacts?
Multimodal universality: Does the Platonic Representation Hypothesis extend to multimodal models? Do vision-language models converge to a single world model faster than unimodal ones?
Adversarial robustness: Adversarial examples are a systematic failure of generalization. Can LM's mechanistic framework explain why they exist and predict which models are robust?
Continual learning: Catastrophic forgetting lacks a mechanistic explanation. Is it a phase transition? Does it connect to the edge of stability?
Architecture comparison: SSMs vs transformers have different inductive biases. Can LM predict which architecture is optimal for which data distribution?
Post-training alignment: RLHF changes more than just the output distribution — it changes internal representations. What does LM predict about how alignment fine-tuning changes the learned world model?
Causal intervention theory: When an MI experiment ablates a circuit, what does LM predict about the effect on other circuits? Can LM give a principled account of circuit independence?

Cheat sheet — five evidence lines:
1. Solvable settings — deep linear nets (greedy SVD), NTK (lazy regime exact solution)
2. Simplifying limits — lazy vs rich (α parameter), Discretization Hypothesis (SGD → discrete structure)
3. Empirical laws — scaling laws (L = αN^{-β} + L∞), edge of stability (λ_max → 2/η), neural collapse
4. Hyperparameter transfer — µP (width-independent optimal LR), implicit curvature regularization
5. Universality — cross-architecture rep similarity, Platonic Representation, data power laws

Key papers

Paper	Year	Contribution	Evidence Line
Saxe, McClelland, Ganguli	2014	Deep linear network exact dynamics — sequential SV learning	Solvable settings
Jacot, Gabriel, Hongler	2018	Neural Tangent Kernel — infinite-width exact solution	Solvable settings
Kaplan et al.	2020	Neural scaling laws — L = αN^{-β} + L∞ over 5 orders of magnitude	Empirical laws
Cohen et al.	2021	Edge of Stability — sharpness tracks 2/η during training	Empirical laws
Yang et al. (µP)	2022	Maximal Update Parameterization — width-independent hyperparameter transfer	Hyperparameter transfer
Huh, Cheung, Bernstein, Isola	2024	Platonic Representation Hypothesis — cross-architecture convergence	Universality

Learning Mechanics Concept Map

The five evidence lines and how they connect to each other and to practice. Lines show theoretical dependencies.

Related lessons on this site

ActionFormer — transformer architectures for temporal action detection
pi-0 — VLA flow model: where practice runs ahead of theory

According to the paper, why are the five evidence lines for Learning Mechanics stronger together than any one of them individually?

Each line uses different methods, datasets, and scales — their convergence on consistent descriptions of deep learning dynamics rules out coincidence and data-specific artifacts More evidence always means more confidence, regardless of whether the evidence is independent The five lines were designed to cover different architectures, so together they are exhaustive