Loss follows smooth power laws across seven orders of magnitude in compute, data, and parameters. This paper proved that scaling is predictable — and launched the era of training ever-larger models.
It is January 2020. GPT-2 has just shown that language models can write coherent paragraphs. Labs everywhere are asking the same question: should we build a bigger model?
Bigger means more parameters, more data, more compute — more money. A 1.5-billion parameter model already costs serious GPU time. Going to 10 billion or 100 billion is a bet worth millions of dollars. Nobody wants to make that bet blind.
The stakes were real. Training GPT-2 (1.5 billion parameters) required weeks on high-end hardware. Going 10x bigger meant 10x the cost — and you might get nothing useful if the architecture hit a ceiling or the data ran out. The field had no theory for whether bigger models would keep getting better, or when they would plateau.
Kaplan and McCandlish at OpenAI decided to answer the question empirically. They trained hundreds of Transformer language models — from tiny 768-parameter networks to 1.5-billion parameter giants — and carefully measured how test loss (cross-entropy, in nats per token) changed as they varied three things:
What they found stunned the field. The relationship between loss and each of these three variables is not chaotic, not complicated, not dependent on clever tricks. It is a smooth power law — a straight line on a log-log plot — spanning more than seven orders of magnitude.
This paper did not just describe a curious empirical regularity. It gave the field a planning tool. For the first time, you could predict how well a model would perform before spending a cent on training. You could ask: "Given my GPU budget, how big should my model be? How much data do I need?" And get a quantitative answer.
The answers were surprising. The paper argued that you should train very large models on relatively modest amounts of data, and stop well before convergence. This was the opposite of conventional wisdom, which said you should train smaller models for longer. Two years later, DeepMind's Chinchilla paper would partially revise this conclusion — but the core framework of predictable power-law scaling remains the foundation of how we think about scale today.
A note on cross-entropy loss, since we'll use it throughout. Cross-entropy measures how surprised your model is by the next token. A loss of 3.5 nats/token means the model assigns, on average, e-3.5 ≈ 3% probability to the correct next token. A loss of 2.5 nats/token means e-2.5 ≈ 8% — much better. Lower loss means the model is less surprised, which means it understands language better.
The unit is nats (natural units of information), not bits. One nat = 1/ln(2) ≈ 1.44 bits. The paper uses nats throughout because they arise naturally from the natural logarithm in cross-entropy.
Let's understand exactly how these scaling laws work, what they mean, and why they changed everything.
Before we can understand the scaling laws, we need to understand the mathematical object at their heart: the power law.
A power law is a relationship of the form:
where α (alpha) is a constant exponent. If α is negative, then y decreases as x grows — exactly what we want for loss decreasing as we add more parameters.
The magical property of a power law is what happens when you take logarithms of both sides:
This is the equation of a straight line, with slope α. So if you plot y versus x on log-log axes (both axes use logarithmic scale), a power law becomes a perfectly straight line. The slope of that line is the exponent.
Power laws are everywhere in nature. City populations, earthquake magnitudes, word frequencies (Zipf's law), and now neural network performance all follow them. They typically arise when the underlying system has self-similarity across scales — the structure at one scale looks like a rescaled version of the structure at another.
Contrast this with exponential relationships like y = a · ebx. On a log-linear plot (log y vs. x), exponentials are straight lines. On a log-log plot, they curve. Power laws and exponentials look completely different, and they behave differently at scale. Exponentials either explode or decay to zero rapidly. Power laws are gentler — they always give diminishing returns, never hitting a wall, never leveling off to a hard plateau. This is exactly what Kaplan et al. observed for neural network loss.
Why might neural scaling follow power laws rather than exponentials? The paper doesn't fully answer this, but one intuition comes from information theory. Natural language has structure at many scales — character patterns, word collocations, syntactic rules, semantic relationships, discourse structure, world knowledge. A model with N parameters can capture structure up to some complexity threshold. Each order of magnitude in N gives access to a new "layer" of linguistic structure, yielding a roughly constant fractional improvement in loss. This multiplicative structure at each scale is the hallmark of a power law.
For this paper, the specific form is:
where L is the test loss (cross-entropy in nats), N is the number of non-embedding parameters, Nc is a constant that sets the scale, and αN ≈ 0.076 is the power-law exponent. That exponent tells you: every time you double N, the loss shrinks by a factor of 2-0.076 ≈ 0.949, about a 5% reduction.
Five percent might sound small. But it compounds. Going from 1 million to 1 billion parameters (a factor of 1000) reduces loss by 10000.076 ≈ 1.7x. And because cross-entropy loss operates in log-probability space, even small reductions can mean dramatically better text generation.
One PF-day, by the way, is 1015 × 86,400 = 8.64 × 1019 floating-point operations. A modern GPU (A100) does about 3 × 1014 FLOPS, so one PF-day is roughly 3 days on a single A100. The paper spans from about 10-8 PF-days (seconds of training) to 102 PF-days (months on a GPU cluster).
The paper identifies three independent scaling laws, one for each "knob" you can turn. Each is a power law. Each produces a straight line on a log-log plot. And each holds when the other two knobs are turned up high enough not to be the bottleneck.
But first, a note on the experimental setup. All models are decoder-only Transformers trained on WebText2 — an extended version of the dataset used for GPT-2, containing about 23 billion tokens scraped from Reddit outbound links with at least 3 karma (a heuristic for "interesting content"). They used the Adam optimizer with a cosine learning rate schedule, 1024-token contexts, and byte-pair encoding with a vocabulary of 50,257 tokens. The crucial decision to count only non-embedding parameters — excluding the token embedding and positional embedding matrices — produced much cleaner scaling laws.
Compute is estimated as C ≈ 6NBS floating-point operations, where B is batch size and S is the number of training steps. The factor 6 accounts for the forward pass (2N FLOPs per token) and backward pass (roughly 4N FLOPs per token). One PF-day = 8.64 × 1019 FLOPs.
With αN ≈ 0.076 and Nc ≈ 8.8 × 1013. This holds when you train to convergence on a sufficiently large dataset. It says: make the model bigger, loss goes down, predictably. The exponent 0.076 means you get diminishing returns — you need roughly 10x more parameters for each 17% reduction in loss.
With αD ≈ 0.095 and Dc ≈ 5.4 × 1013 tokens. This holds for a sufficiently large model trained with early stopping. More data helps, and the returns are slightly better than for parameters (αD > αN).
To isolate this law, the paper trained a single large model (nlayer=36, dmodel=1280) on fixed subsets of WebText2, ranging from 22 million to 22 billion tokens. They stopped training once the test loss ceased to decrease. The resulting losses fell on a clean power law in D. The constant Dc ≈ 5.4 × 1013 is enormously large — roughly 54 trillion tokens. This means we are very far from the point where the data scaling law would predict zero loss. Every dataset we currently use is "small" by the standards of this power law.
With αCmin ≈ 0.050 and Ccmin ≈ 3.1 × 108 PF-days. This is the most important law for practitioners, because compute is usually the binding constraint. It says: given a fixed compute budget, with optimal allocation between model size and training duration, loss scales as a power law in total compute.
The subscript "min" on Cmin is important. It refers to the minimum compute needed to reach a given loss — what you'd use if training at the optimal (usually small) batch size. In practice, researchers often train at larger batch sizes for speed, using more total compute than Cmin. The paper carefully adjusts for this by defining Cmin(C) = C / (1 + B/Bcrit), which corrects for the inefficiency of training at batch sizes above the critical threshold.
Notice the exponents are all small: 0.076, 0.095, 0.050. This means the returns to scale are always diminishing. You never get linear returns. But you always get returns. There is no wall, no plateau, no point of "good enough." As long as you can afford to scale, performance improves — smoothly and predictably.
Let's build intuition for what these exponents feel like. With αN = 0.076:
Each factor of 10 in parameters buys you the same percentage reduction. This is the essence of a power law: constant returns on a logarithmic scale. Going from 1M to 10M parameters gives the same percentage improvement as going from 1B to 10B. The absolute improvement in loss is larger for the smaller jump, but the relative improvement is identical.
The compute exponent αC = 0.050 is the smallest, which makes sense. Compute is "split" between parameters and training time, so its effective power is diluted. The paper derives a theoretical relationship: αCmin = 1/(1/αS + 1/αB + 1/αN), which predicts 0.054 — matching the empirical 0.050 within measurement error.
The paper verified these trends across six orders of magnitude in N (768 to 1.5 billion parameters), two orders of magnitude in D (22 million to 23 billion tokens), and eight orders of magnitude in compute. At no point did the straight lines on the log-log plots show any sign of bending.
Here is one of the paper's most surprising findings. You might expect that the architecture of a Transformer — how deep it is, how wide, how many attention heads — would strongly affect performance. After all, researchers spend enormous effort tuning these hyperparameters.
Kaplan et al. found the opposite. When they held the total number of non-embedding parameters N fixed and varied the shape — depth versus width, feed-forward ratio, number of attention heads — the loss changed by only a few percent.
This is enormously consequential. It means that when budgeting for performance, you should think in terms of total parameter count, not architectural details. The knob that matters is "how big," not "what shape."
The paper tested this systematically. They varied the feed-forward ratio (dff / dmodel, normally 4), the aspect ratio (dmodel / nlayer), and the attention head dimension (dmodel / nheads), each time adjusting other dimensions to keep N constant at either 25M or 50M parameters. The loss varied by at most a few percent across the entire range.
There are some caveats. Models with fewer than 2 layers perform noticeably worse, and extreme depth-to-width ratios (very narrow, very deep) degrade slightly. But within a wide "reasonable" range, shape is approximately irrelevant.
Why might this be? One speculation, noted in the paper, is that deeper Transformers may behave as ensembles of shallower models, similar to what has been observed for ResNets. If deeper models are roughly equivalent to ensembles of shallower models, then swapping depth for width (at constant total parameters) doesn't fundamentally change what the model can represent — it just reorganizes the same capacity.
One important detail: this result only holds when you count non-embedding parameters. The embedding matrix (which maps tokens to vectors) and positional embeddings are excluded. When embedding parameters are included, performance appears to depend on depth — but this is an artifact. The embedding matrix has nvocab × dmodel parameters, which scales linearly with dmodel. For a shallow, wide model (large dmodel), the embedding matrix dominates the total parameter count, inflating it without contributing proportionally to modeling capacity. Excluding it reveals the clean scaling law.
This is a methodological lesson that reverberates through all subsequent scaling work: choose your metrics carefully. Counting the "wrong" parameters (including embeddings) would have obscured the very regularity this paper discovered. The authors tried both and noticed the cleaner fit. Sometimes the most important scientific decision is what to put on your axes.
This formula counts the non-embedding parameters in a standard Transformer, where dattn = dff/4 = dmodel. The factor 12 comes from the attention (Q, K, V, output projection) and feed-forward (two linear layers) components. Each layer contributes 12dmodel2 parameters.
The practical upshot: if you are designing a model and have a target parameter count, you have wide latitude in choosing the aspect ratio. Pick whatever is most convenient for your hardware (wide models parallelize better across GPUs; deep models use less memory per layer).
The paper also compared Transformers to LSTMs. Both architectures follow power laws, but with a critical difference: Transformers outperform LSTMs on late tokens in the context. An LSTM's performance plateaus after about 100 tokens of context — it cannot effectively use information from early in a long sequence. Transformers improve continuously through the full 1024-token context. This is a consequence of the attention mechanism providing O(1) access to all positions, versus the LSTM's sequential information bottleneck.
When do you start wasting parameters? When your model is too big for your data, it memorizes instead of generalizing. This is overfitting, and it is the bane of machine learning. Every practitioner has experienced it: training loss keeps dropping, but test loss starts climbing. The model is fitting noise in the training data rather than learning generalizable patterns.
The conventional wisdom is that overfitting is unpredictable — you just have to watch the validation curve and stop when it turns. But Kaplan et al. showed that overfitting, too, follows a predictable law. You can calculate in advance exactly how much data you need for a given model size to keep overfitting below any threshold.
The unified equation for loss as a function of both model size N and dataset size D is:
This single equation elegantly encodes both individual scaling laws. Set D to infinity: the Dc/D term vanishes, and you recover L(N) = (Nc/N)αN. Set N to infinity: the first term vanishes, and you recover L(D) = (Dc/D)αD. In between, the two terms compete — whichever is larger (model too small, or data too small) dominates the loss.
The paper derived this functional form from three principles. First, changes in tokenization should rescale the loss by a constant factor, which this form allows. Second, fixing either N or D and sending the other to infinity should recover the single-variable power laws. Third, the loss should be analytic at D = ∞, meaning it has a series expansion in powers of 1/D. This third principle is the most speculative, but it explains the asymmetric roles of N and D in the equation.
The degree of overfitting — defined as δL = L(N,D)/L(N,∞) - 1, the fractional excess loss compared to training with infinite data — depends on a specific ratio:
When this ratio is small, overfitting is negligible. When it grows, the model has more capacity than the data can fill, and it starts memorizing. The exponent 0.74 = αN/αD arises directly from the combined L(N,D) equation. It has a beautifully practical meaning:
This sub-linear scaling was surprising and has important economic implications. Naive intuition says "bigger model, more parameters to fit, need proportionally more data." But larger Transformers extract more information per data point. They learn faster. A 1-billion parameter model trained on 10 billion tokens achieves the same loss as a 10-million parameter model trained on a much larger fraction of those tokens.
Think about it from an information-theoretic perspective. A larger model has more "buckets" to sort information into. When it sees a training example, it can extract finer-grained patterns from it. A small model might learn "this is an English sentence" from a data point. A large model might additionally learn the subtle syntactic structure, the semantic relationships between words, and the pragmatic context — all from the same example. More parameters means more information per data point means sub-linear data requirements.
The paper uses a 10% dropout rate for all models in these experiments. They acknowledge that optimizing regularization (varying dropout rate with model and dataset size) might change the quantitative results, but expect the qualitative picture — sub-linear data scaling — to hold. This is an important caveat: the specific exponent 0.74 is measured under one particular training recipe, and might shift slightly under different regularization choices.
Let's make this concrete. With equation (4.4) from the paper, to avoid overfitting you need:
For a 1B parameter model: D ≥ 5000 × (109)0.74 ≈ 5000 × 1.1 × 106 ≈ 5.5 × 109 tokens (about 5.5 billion tokens). The WebText2 training set had 22 billion tokens, so a 1B-parameter model trains comfortably without overfitting. But a 10B-parameter model would need D ≥ 5000 × (1010)0.74 ≈ 30 billion tokens, already pushing the boundary of the training set.
The paper fit this combined equation to experiments spanning from 22 million to 22 billion tokens of training data and from tiny to billion-parameter models. The fits were excellent, except for extremely small datasets (~20 million tokens), which may represent a qualitatively different regime where overfitting happens within the first few training steps.
An important detail: the paper also found that transfer loss — performance on text distributions different from the training set (books, Wikipedia, Common Crawl) — scales in parallel with the training distribution. There is a constant offset (the new distribution is harder or easier), but the slope is the same. This means scaling laws predict out-of-distribution performance too, not just performance on the training set. Generalization improves predictably with scale.
Watch a neural network train. Loss starts high and drops rapidly at first, then slows. This learning curve has a characteristic shape. What Kaplan et al. discovered is that this shape is essentially the same for all model sizes.
After an initial transient period (the first few thousand steps, where the learning rate is warming up), all learning curves can be fit by:
where Smin is the number of training steps (adjusted to the optimal batch size), Sc ≈ 2.1 × 103, and αS ≈ 0.76.
This equation has two additive terms, and the additive structure is key to understanding it.
The first term, (Nc/N)αN, is the irreducible loss for model size N — the best you could ever achieve by training this model forever on infinite data. It represents the model's fundamental capacity limit. A 10M-parameter model simply cannot represent certain patterns in language, no matter how long you train it.
The second term, (Sc/Smin)αS, is the training deficit — how far you are from that floor because you haven't trained long enough. Early in training, this term dominates. Late in training, it shrinks and the model's capacity limit takes over.
The additive form means the two sources of error are independent. You can be limited by model size, limited by training time, or (in the worst case) limited by both. The optimal allocation of compute tries to balance these two terms so neither dominates excessively.
The universality runs deep. The training-time exponent αS ≈ 0.76 is the same for all model sizes. The constant Sc is the same for all model sizes. Only the irreducible loss floor changes with N. This means all models follow parallel tracks on a log-loss vs. log-steps plot — larger models start lower and stay lower, but the rate of improvement per step is universal.
The paper speculates that this universality reflects something about the loss landscape of Transformers. Since the power-law fits are best late in training (when the loss surface may be approximately quadratic), the exponent αS likely encodes information about the spectrum of the Hessian matrix — specifically, the density of eigenvalues. The fact that this spectrum appears to be roughly independent of model size is a deep and unexplained empirical regularity.
A practical consequence: if you are running a hyperparameter search, you can train many small models for short runs, measure their learning curves, and use the universal learning curve equation to predict which configuration would perform best at full scale. This is a massive cost savings — you replace one expensive run with many cheap ones.
What about early stopping? The paper shows you can predict when to stop training. If you know L(N, D) — the loss you'll achieve with model size N and dataset D — and L(N, ∞) — the loss with infinite data — then the gap between them tells you when overfitting has begun to dominate. The optimal early stopping point Sstop satisfies an inequality derived from the learning curve equation, allowing you to estimate in advance how many steps a training run needs.
There is also a notion of critical batch size Bcrit, which follows its own power law:
with B* ≈ 2 × 108 tokens and αB ≈ 0.21. Below Bcrit, increasing batch size is nearly free (same total compute, fewer steps). Above it, you get diminishing returns. The critical batch size depends only on the current loss, not on model size — another sign of universality.
The critical batch size has a physical interpretation from McCandlish et al. (2018). It represents the point where the gradient noise scale — the ratio of gradient variance to squared gradient norm — equals the batch size. Below this threshold, each additional sample in the batch provides genuinely new gradient information. Above it, you're mostly averaging out noise, with diminishing returns per additional sample. For the largest models Kaplan et al. trained, Bcrit was about 1-2 million tokens — big, but far smaller than modern training batch sizes.
This is the chapter that launched a thousand GPU clusters. Given a fixed compute budget C, how should you split it between model size, batch size, and number of training steps?
The answer comes from minimizing L(N, Smin) subject to the constraint Cmin = 6N · B · S. The paper finds:
This means optimal training stops far short of convergence. You build the biggest model your compute budget allows, train it on a modest dataset for a modest number of steps, and stop. This is enormously more sample-efficient than the conventional approach of training smaller models to convergence.
Think about what this means for a lab planning a large training run. The conventional approach would be: pick a model size, gather as much data as possible, and train until the loss plateaus. The scaling law approach is radically different: given your compute budget, the optimal strategy is to pick a much larger model than you think you need, and train it for fewer steps than you think are necessary. You are deliberately under-training an over-sized model — and this gives you the best loss per FLOP.
The paper quantifies the "zone of near-optimality" around the ideal N. Models between 0.6x and 2.2x the optimal size can be trained with at most a 20% compute overhead. So you do not need to hit the exact optimal model size — there is a comfortable range. But being 10x off (too small or too large) is genuinely wasteful.
The theoretical prediction matches empirical observation almost perfectly. The paper derives:
and the predicted exponents for N, B, and S as functions of Cmin match the empirical fits to within a few percent. The scaling framework is internally consistent — the independently measured power laws all connect together correctly.
Let's work through a concrete example. Suppose you have a budget of 103 PF-days (about 10,000 A100-days). The scaling laws predict:
There is an important subtlety. The paper estimates that compute-efficient training uses data D ∝ C0.27, growing slowly with compute. But to avoid overfitting, you need D ∝ N0.74 ∝ C0.54, which grows much faster. These two constraints eventually contradict each other at around C* ~ 104 PF-days, N* ~ 1012 parameters, D* ~ 1012 tokens, and L* ~ 1.7 nats/token. The authors conjectured this intersection might represent a fundamental limit — the point at which you have extracted all learnable structure from natural language.
This conjecture is fascinating because it connects scaling laws to the entropy of natural language. If L* ≈ 1.7 nats/token, that would imply English text has an entropy of about 1.7 nats ≈ 2.4 bits per token. Shannon's original 1951 estimate was about 1-1.5 bits per character, which translates to roughly 4-6 bits per token (at ~4 characters per token) — in the same ballpark, but higher. The truth is uncertain because the extrapolation spans many orders of magnitude beyond observed data.
As of 2024, we have reached and surpassed N* = 1012 parameters (GPT-4 is rumored to be in this range). And the scaling laws have not broken down in the catastrophic way the paper feared. The resolution seems to be that new, higher-quality data sources (code, curated web, synthetic data) and better training recipes have shifted the constants, extending the power laws further than the 2020 extrapolation predicted. The framework was right; the specific numbers were bound to a particular data regime.
Now let's see the scaling laws in action. The interactive plot below shows three views: loss vs. parameters, loss vs. data, and loss vs. compute. Each view renders the power-law relationship as a straight line on log-log axes, with simulated empirical data points scattered along it.
The key interactive element is the compute budget slider. Drag it to change your total compute budget from 10-8 PF-days (a few seconds of training) to 102 PF-days (months on a GPU cluster). Watch how the optimal model size, dataset size, and achievable loss change as you move the slider.
On the "Loss vs N" tab, notice how the optimal-N marker jumps rapidly rightward as you increase compute — this visualizes N ∝ C0.73. On the "Loss vs D" tab, the optimal-D marker moves much more slowly — D ∝ C0.27. This asymmetry is the paper's central practical finding.
Use the tabs to switch between views. Adjust the compute budget to watch the optimal allocation change in real time.
To connect the simulation to real systems, here are some landmarks on the compute axis:
The beauty of the scaling law framework is that even though this paper only measured up to ~102 PF-days, the straight-line extrapolation to 104 has proven remarkably accurate. The power laws predicted GPT-3's performance before it was trained. They are the closest thing to a crystal ball that deep learning has produced.
This paper's influence cannot be overstated. It established scaling as a first-class research direction in machine learning — not just "let's make things bigger and hope," but "here are the exact equations that predict how much better things get as you scale." Before Kaplan et al., researchers focused on architectural innovations — better attention mechanisms, better normalization, clever training tricks. After this paper, the field realized that simply making things bigger, if done correctly, was often more effective than any architectural tweak.
The paper's methodology also set a standard. The careful controlled experiments — varying one factor at a time, holding others fixed, plotting on log-log axes, fitting power laws — became the template for all subsequent scaling studies. The field of "scaling science" or "scaling research" essentially began here.
GPT-3, announced just five months later (May 2020), was built directly on these scaling laws. Its 175 billion parameters and the decision to train on "only" 300 billion tokens (stopping well before convergence on that data) were informed by the predictions in this paper. Several of the paper's authors — Tom Brown, Alec Radford, Rewon Child, Scott Gray, Jeff Wu, Benjamin Chess, and Dario Amodei — went on to co-author the GPT-3 paper. The scaling laws literally told them how big to build it.
GPT-3's remarkable few-shot capabilities validated the approach: scale up, and emergent abilities appear. Tasks that were impossible at 1B parameters became possible at 175B, not through any architectural innovation, but through pure scale. The scaling laws had predicted the loss correctly; what they hadn't predicted was that lower loss would unlock qualitatively new capabilities like in-context learning.
Chinchilla's key result: a 70B parameter model trained on 1.4 trillion tokens outperforms the 280B parameter Gopher model trained on 300 billion tokens, despite using the same compute budget. The smaller model was better because it was trained on more data. Under the Kaplan recipe, Gopher was over-sized and under-trained.
Why the discrepancy? Several factors:
But here is what matters most: the framework is the same. Both papers agree that scaling laws exist, that they are power laws, and that they can guide compute allocation. The disagreement is only about the precise exponents — how fast N should grow versus D — not the paradigm itself. Kaplan et al. built the framework. Chinchilla refined the numbers. Both papers are right that you should use power laws to plan your training runs. They just disagree on the optimal split.
The practical consequence of Chinchilla was immediate. LLaMA (Meta, 2023) was explicitly designed to be "Chinchilla-optimal" — a 65B parameter model trained on 1.4 trillion tokens, rather than a 175B model trained on 300 billion tokens like GPT-3. LLaMA matched or exceeded GPT-3 performance at a fraction of the inference cost, because smaller models are cheaper to serve.
This highlights a distinction the original Kaplan paper did not emphasize: training compute vs. inference compute. The Kaplan recipe optimizes for training efficiency — best loss per training FLOP. But once trained, a large model costs more to run on every query. If you plan to serve millions of users, the inference cost dominates, and you want the smallest model that achieves your target loss. Chinchilla's "train more data, use fewer parameters" approach better serves this goal.
Today, the scaling laws paradigm has expanded far beyond language. Similar power laws have been found for vision models (ViT), multimodal models (CLIP, Flamingo), code generation (Codex), protein folding (AlphaFold), and even scientific simulation. The principle that performance is a smooth, predictable function of scale has become one of the most reliable empirical facts in deep learning.
More recent work has extended scaling laws to include inference-time compute (how much thinking the model does at test time, as in chain-of-thought or tree search), post-training (RLHF, instruction tuning, where performance also scales predictably with reward model size and preference data), and mixture-of-experts architectures (where active parameters and total parameters follow different scaling curves). The framework keeps generalizing. Kaplan et al. gave us the grammar; the field is still writing sentences with it.
Perhaps the deepest legacy is cultural. Before this paper, the AI research community valued clever ideas above all else — a new attention variant, a better loss function, a smarter data augmentation trick. After this paper, the community also valued scale as a legitimate research strategy.
This shift led directly to the current era of frontier models, where the binding constraint is not ideas but compute, data, and engineering. Whether this is a good thing is debated — it has certainly raised the barrier to entry for academic researchers who lack access to massive GPU clusters. But the empirical reality is undeniable: the scaling laws work. Models built according to their predictions consistently outperform models designed with clever tricks but insufficient scale.
One final thought. The scaling laws tell us that loss improves predictably with scale. But they do not tell us what capabilities emerge at each loss level. The jump from "can complete sentences" to "can write essays" to "can reason about code" happens at specific loss thresholds that the scaling laws do not predict. Understanding the relationship between loss and capabilities remains one of the most important open problems in AI research — a frontier that Kaplan et al.'s framework points toward but does not cross.
This paper sits at a critical junction in the history of AI scaling. It took an empirical observation — bigger models work better — and turned it into a quantitative science. Let's map its connections to the broader landscape.
Before this paper, "scaling up" was a heuristic, a belief, a bet. After it, scaling became a research program with quantitative predictions. You could write down an equation, plug in your compute budget, and know what loss to expect. This transformed AI from a field driven by architectural innovation to one driven (at least partially) by resource allocation.
The paper is also philosophically significant. It suggests that the details of neural network architecture matter far less than the brute fact of scale. Width, depth, attention heads — these are secondary. What matters is how many parameters, how much data, how much compute. This echoes Rich Sutton's "Bitter Lesson" (2019): in the long run, methods that leverage computation scale better than methods that leverage human knowledge about the domain.
It is worth noting that several authors of this paper went on to co-found Anthropic (Dario Amodei, Sam McCandlish, Jared Kaplan, Tom Brown, among others). The scaling laws framework was not just an academic exercise — it was foundational to the strategy of building increasingly capable AI systems. The paper's predictions helped guide billions of dollars of compute investment across the industry.
A summary of the paper's key equations, for reference:
Paper details. "Scaling Laws for Neural Language Models," Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei. OpenAI / Johns Hopkins University, January 2020. arXiv:2001.08361.