Scaling Laws for Neural Language Models

Chapter 0: The Bet

It is January 2020. GPT-2 has just shown that language models can write coherent paragraphs. Labs everywhere are asking the same question: should we build a bigger model?

Bigger means more parameters, more data, more compute — more money. A 1.5-billion parameter model already costs serious GPU time. Going to 10 billion or 100 billion is a bet worth millions of dollars. Nobody wants to make that bet blind.

The stakes were real. Training GPT-2 (1.5 billion parameters) required weeks on high-end hardware. Going 10x bigger meant 10x the cost — and you might get nothing useful if the architecture hit a ceiling or the data ran out. The field had no theory for whether bigger models would keep getting better, or when they would plateau.

Kaplan and McCandlish at OpenAI decided to answer the question empirically. They trained hundreds of Transformer language models — from tiny 768-parameter networks to 1.5-billion parameter giants — and carefully measured how test loss (cross-entropy, in nats per token) changed as they varied three things:

N — Parameters

The number of non-embedding parameters in the model

↓

D — Dataset size

The number of tokens used for training

↓

C — Compute

Total floating-point operations, measured in PF-days

What they found stunned the field. The relationship between loss and each of these three variables is not chaotic, not complicated, not dependent on clever tricks. It is a smooth power law — a straight line on a log-log plot — spanning more than seven orders of magnitude.

The headline result: Double your compute budget, and the loss drops by a predictable, fixed percentage. This regularity holds across a factor of 10,000,000 in scale. Deep learning is not alchemy — it is engineering with predictable returns.

This paper did not just describe a curious empirical regularity. It gave the field a planning tool. For the first time, you could predict how well a model would perform before spending a cent on training. You could ask: "Given my GPU budget, how big should my model be? How much data do I need?" And get a quantitative answer.

The answers were surprising. The paper argued that you should train very large models on relatively modest amounts of data, and stop well before convergence. This was the opposite of conventional wisdom, which said you should train smaller models for longer. Two years later, DeepMind's Chinchilla paper would partially revise this conclusion — but the core framework of predictable power-law scaling remains the foundation of how we think about scale today.

A note on cross-entropy loss, since we'll use it throughout. Cross-entropy measures how surprised your model is by the next token. A loss of 3.5 nats/token means the model assigns, on average, e^-3.5 ≈ 3% probability to the correct next token. A loss of 2.5 nats/token means e^-2.5 ≈ 8% — much better. Lower loss means the model is less surprised, which means it understands language better.

The unit is nats (natural units of information), not bits. One nat = 1/ln(2) ≈ 1.44 bits. The paper uses nats throughout because they arise naturally from the natural logarithm in cross-entropy.

Let's understand exactly how these scaling laws work, what they mean, and why they changed everything.

What is the key empirical finding of this paper?

Larger models are always better regardless of data or compute Test loss follows smooth power laws with model size, data size, and compute Transformer architecture details strongly determine performance

Chapter 1: What Is a Power Law?

Before we can understand the scaling laws, we need to understand the mathematical object at their heart: the power law.

A power law is a relationship of the form:

y = a · x^α

where α (alpha) is a constant exponent. If α is negative, then y decreases as x grows — exactly what we want for loss decreasing as we add more parameters.

The magical property of a power law is what happens when you take logarithms of both sides:

log(y) = log(a) + α · log(x)

This is the equation of a straight line, with slope α. So if you plot y versus x on log-log axes (both axes use logarithmic scale), a power law becomes a perfectly straight line. The slope of that line is the exponent.

Why this matters: If your data falls on a straight line in log-log space, you have a power law. And a straight line is the easiest thing in the world to extrapolate. You can predict performance at scales you have never tested — a 100-billion parameter model — by extending a line fit to models a thousand times smaller.

Power laws are everywhere in nature. City populations, earthquake magnitudes, word frequencies (Zipf's law), and now neural network performance all follow them. They typically arise when the underlying system has self-similarity across scales — the structure at one scale looks like a rescaled version of the structure at another.

Contrast this with exponential relationships like y = a · e^bx. On a log-linear plot (log y vs. x), exponentials are straight lines. On a log-log plot, they curve. Power laws and exponentials look completely different, and they behave differently at scale. Exponentials either explode or decay to zero rapidly. Power laws are gentler — they always give diminishing returns, never hitting a wall, never leveling off to a hard plateau. This is exactly what Kaplan et al. observed for neural network loss.

Why might neural scaling follow power laws rather than exponentials? The paper doesn't fully answer this, but one intuition comes from information theory. Natural language has structure at many scales — character patterns, word collocations, syntactic rules, semantic relationships, discourse structure, world knowledge. A model with N parameters can capture structure up to some complexity threshold. Each order of magnitude in N gives access to a new "layer" of linguistic structure, yielding a roughly constant fractional improvement in loss. This multiplicative structure at each scale is the hallmark of a power law.

For this paper, the specific form is:

L(N) = (N_c / N)^α_N

where L is the test loss (cross-entropy in nats), N is the number of non-embedding parameters, N_c is a constant that sets the scale, and α_N ≈ 0.076 is the power-law exponent. That exponent tells you: every time you double N, the loss shrinks by a factor of 2^-0.076 ≈ 0.949, about a 5% reduction.

Five percent might sound small. But it compounds. Going from 1 million to 1 billion parameters (a factor of 1000) reduces loss by 1000^0.076 ≈ 1.7x. And because cross-entropy loss operates in log-probability space, even small reductions can mean dramatically better text generation.

A worked example. Suppose a 10M-parameter model achieves a loss of 3.4 nats/token. What does the scaling law predict for a 1B-parameter model (100x larger)?

L(1B) / L(10M) = (10M / 1B)^0.076 = (0.01)^0.076 = 10^{-2 × 0.076} = 10^-0.152 ≈ 0.705

So L(1B) ≈ 3.4 × 0.705 = 2.4 nats/token. In practice, Kaplan et al. measured exactly this. The prediction works because the underlying relationship really is a power law — the data points fall on that straight line with remarkable precision.

One PF-day, by the way, is 10¹⁵ × 86,400 = 8.64 × 10¹⁹ floating-point operations. A modern GPU (A100) does about 3 × 10¹⁴ FLOPS, so one PF-day is roughly 3 days on a single A100. The paper spans from about 10^-8 PF-days (seconds of training) to 10² PF-days (months on a GPU cluster).

On a log-log plot, what shape does a power-law relationship produce?

A straight line An exponential curve A plateau that flattens out

Chapter 2: Three Knobs, Three Laws

The paper identifies three independent scaling laws, one for each "knob" you can turn. Each is a power law. Each produces a straight line on a log-log plot. And each holds when the other two knobs are turned up high enough not to be the bottleneck.

But first, a note on the experimental setup. All models are decoder-only Transformers trained on WebText2 — an extended version of the dataset used for GPT-2, containing about 23 billion tokens scraped from Reddit outbound links with at least 3 karma (a heuristic for "interesting content"). They used the Adam optimizer with a cosine learning rate schedule, 1024-token contexts, and byte-pair encoding with a vocabulary of 50,257 tokens. The crucial decision to count only non-embedding parameters — excluding the token embedding and positional embedding matrices — produced much cleaner scaling laws.

Compute is estimated as C ≈ 6NBS floating-point operations, where B is batch size and S is the number of training steps. The factor 6 accounts for the forward pass (2N FLOPs per token) and backward pass (roughly 4N FLOPs per token). One PF-day = 8.64 × 10¹⁹ FLOPs.

Law 1: Loss vs. Parameters (N)

L(N) = (N_c / N)^α_N

With α_N ≈ 0.076 and N_c ≈ 8.8 × 10¹³. This holds when you train to convergence on a sufficiently large dataset. It says: make the model bigger, loss goes down, predictably. The exponent 0.076 means you get diminishing returns — you need roughly 10x more parameters for each 17% reduction in loss.

Law 2: Loss vs. Data (D)

L(D) = (D_c / D)^α_D

With α_D ≈ 0.095 and D_c ≈ 5.4 × 10¹³ tokens. This holds for a sufficiently large model trained with early stopping. More data helps, and the returns are slightly better than for parameters (α_D > α_N).

To isolate this law, the paper trained a single large model (n_layer=36, d_model=1280) on fixed subsets of WebText2, ranging from 22 million to 22 billion tokens. They stopped training once the test loss ceased to decrease. The resulting losses fell on a clean power law in D. The constant D_c ≈ 5.4 × 10¹³ is enormously large — roughly 54 trillion tokens. This means we are very far from the point where the data scaling law would predict zero loss. Every dataset we currently use is "small" by the standards of this power law.

Law 3: Loss vs. Compute (C_min)

L(C_min) = (C_c^min / C_min)^{α_C^min}

With α_C^min ≈ 0.050 and C_c^min ≈ 3.1 × 10⁸ PF-days. This is the most important law for practitioners, because compute is usually the binding constraint. It says: given a fixed compute budget, with optimal allocation between model size and training duration, loss scales as a power law in total compute.

The subscript "min" on C_min is important. It refers to the minimum compute needed to reach a given loss — what you'd use if training at the optimal (usually small) batch size. In practice, researchers often train at larger batch sizes for speed, using more total compute than C_min. The paper carefully adjusts for this by defining C_min(C) = C / (1 + B/B_crit), which corrects for the inefficiency of training at batch sizes above the critical threshold.

The critical insight: These three laws are not independent. They are three projections of a single underlying relationship. The loss depends on all three simultaneously, and there is a single equation that unifies them — we will see it in Chapter 4.

Notice the exponents are all small: 0.076, 0.095, 0.050. This means the returns to scale are always diminishing. You never get linear returns. But you always get returns. There is no wall, no plateau, no point of "good enough." As long as you can afford to scale, performance improves — smoothly and predictably.

Let's build intuition for what these exponents feel like. With α_N = 0.076:

2x parameters

Loss × 0.949 (5.1% reduction)

↓

10x parameters

Loss × 0.839 (16.1% reduction)

↓

1000x parameters

Loss × 0.590 (41.0% reduction)

Each factor of 10 in parameters buys you the same percentage reduction. This is the essence of a power law: constant returns on a logarithmic scale. Going from 1M to 10M parameters gives the same percentage improvement as going from 1B to 10B. The absolute improvement in loss is larger for the smaller jump, but the relative improvement is identical.

The compute exponent α_C = 0.050 is the smallest, which makes sense. Compute is "split" between parameters and training time, so its effective power is diluted. The paper derives a theoretical relationship: α_C^min = 1/(1/α_S + 1/α_B + 1/α_N), which predicts 0.054 — matching the empirical 0.050 within measurement error.

The paper verified these trends across six orders of magnitude in N (768 to 1.5 billion parameters), two orders of magnitude in D (22 million to 23 billion tokens), and eight orders of magnitude in compute. At no point did the straight lines on the log-log plots show any sign of bending.

Which scaling exponent is largest, meaning data has the best returns per unit increase?

α_N ≈ 0.076 (parameters) α_D ≈ 0.095 (data) α_C ≈ 0.050 (compute)

Chapter 3: Shape Doesn't Matter

Here is one of the paper's most surprising findings. You might expect that the architecture of a Transformer — how deep it is, how wide, how many attention heads — would strongly affect performance. After all, researchers spend enormous effort tuning these hyperparameters.

Kaplan et al. found the opposite. When they held the total number of non-embedding parameters N fixed and varied the shape — depth versus width, feed-forward ratio, number of attention heads — the loss changed by only a few percent.

The revelation: A model with shape (n_layer=6, d_model=4288) and a model with shape (n_layer=48, d_model=1600) have roughly the same number of parameters. And they achieve nearly the same loss. The aspect ratio varies by a factor of 40, but performance barely budges.

This is enormously consequential. It means that when budgeting for performance, you should think in terms of total parameter count, not architectural details. The knob that matters is "how big," not "what shape."

The paper tested this systematically. They varied the feed-forward ratio (d_ff / d_model, normally 4), the aspect ratio (d_model / n_layer), and the attention head dimension (d_model / n_heads), each time adjusting other dimensions to keep N constant at either 25M or 50M parameters. The loss varied by at most a few percent across the entire range.

There are some caveats. Models with fewer than 2 layers perform noticeably worse, and extreme depth-to-width ratios (very narrow, very deep) degrade slightly. But within a wide "reasonable" range, shape is approximately irrelevant.

Why might this be? One speculation, noted in the paper, is that deeper Transformers may behave as ensembles of shallower models, similar to what has been observed for ResNets. If deeper models are roughly equivalent to ensembles of shallower models, then swapping depth for width (at constant total parameters) doesn't fundamentally change what the model can represent — it just reorganizes the same capacity.

One important detail: this result only holds when you count non-embedding parameters. The embedding matrix (which maps tokens to vectors) and positional embeddings are excluded. When embedding parameters are included, performance appears to depend on depth — but this is an artifact. The embedding matrix has n_vocab × d_model parameters, which scales linearly with d_model. For a shallow, wide model (large d_model), the embedding matrix dominates the total parameter count, inflating it without contributing proportionally to modeling capacity. Excluding it reveals the clean scaling law.

This is a methodological lesson that reverberates through all subsequent scaling work: choose your metrics carefully. Counting the "wrong" parameters (including embeddings) would have obscured the very regularity this paper discovered. The authors tried both and noticed the cleaner fit. Sometimes the most important scientific decision is what to put on your axes.

N ≈ 12 · n_layer · d_model²

This formula counts the non-embedding parameters in a standard Transformer, where d_attn = d_ff/4 = d_model. The factor 12 comes from the attention (Q, K, V, output projection) and feed-forward (two linear layers) components. Each layer contributes 12d_model² parameters.

The practical upshot: if you are designing a model and have a target parameter count, you have wide latitude in choosing the aspect ratio. Pick whatever is most convenient for your hardware (wide models parallelize better across GPUs; deep models use less memory per layer).

The paper also compared Transformers to LSTMs. Both architectures follow power laws, but with a critical difference: Transformers outperform LSTMs on late tokens in the context. An LSTM's performance plateaus after about 100 tokens of context — it cannot effectively use information from early in a long sequence. Transformers improve continuously through the full 1024-token context. This is a consequence of the attention mechanism providing O(1) access to all positions, versus the LSTM's sequential information bottleneck.

When comparing two Transformers with the same number of non-embedding parameters but very different depth-to-width ratios, what happens to performance?

It stays nearly the same (within a few percent) The deeper model always wins significantly The wider model always wins significantly

Chapter 4: Overfitting Is Predictable

When do you start wasting parameters? When your model is too big for your data, it memorizes instead of generalizing. This is overfitting, and it is the bane of machine learning. Every practitioner has experienced it: training loss keeps dropping, but test loss starts climbing. The model is fitting noise in the training data rather than learning generalizable patterns.

The conventional wisdom is that overfitting is unpredictable — you just have to watch the validation curve and stop when it turns. But Kaplan et al. showed that overfitting, too, follows a predictable law. You can calculate in advance exactly how much data you need for a given model size to keep overfitting below any threshold.

The unified equation for loss as a function of both model size N and dataset size D is:

L(N, D) = [ (N_c/N)^α_N/α_D + D_c/D ]^α_D

This single equation elegantly encodes both individual scaling laws. Set D to infinity: the D_c/D term vanishes, and you recover L(N) = (N_c/N)^α_N. Set N to infinity: the first term vanishes, and you recover L(D) = (D_c/D)^α_D. In between, the two terms compete — whichever is larger (model too small, or data too small) dominates the loss.

The paper derived this functional form from three principles. First, changes in tokenization should rescale the loss by a constant factor, which this form allows. Second, fixing either N or D and sending the other to infinity should recover the single-variable power laws. Third, the loss should be analytic at D = ∞, meaning it has a series expansion in powers of 1/D. This third principle is the most speculative, but it explains the asymmetric roles of N and D in the equation.

The degree of overfitting — defined as δL = L(N,D)/L(N,∞) - 1, the fractional excess loss compared to training with infinite data — depends on a specific ratio:

δL ∝ N^α_N/α_D / D ≈ N^0.74 / D

When this ratio is small, overfitting is negligible. When it grows, the model has more capacity than the data can fill, and it starts memorizing. The exponent 0.74 = α_N/α_D arises directly from the combined L(N,D) equation. It has a beautifully practical meaning:

The 8x/5x rule: Every time you increase the model size by 8x, you only need to increase the dataset by roughly 5x to keep overfitting at the same level. Data requirements grow sub-linearly with model size. Larger models are more sample-efficient.

This sub-linear scaling was surprising and has important economic implications. Naive intuition says "bigger model, more parameters to fit, need proportionally more data." But larger Transformers extract more information per data point. They learn faster. A 1-billion parameter model trained on 10 billion tokens achieves the same loss as a 10-million parameter model trained on a much larger fraction of those tokens.

Think about it from an information-theoretic perspective. A larger model has more "buckets" to sort information into. When it sees a training example, it can extract finer-grained patterns from it. A small model might learn "this is an English sentence" from a data point. A large model might additionally learn the subtle syntactic structure, the semantic relationships between words, and the pragmatic context — all from the same example. More parameters means more information per data point means sub-linear data requirements.

The paper uses a 10% dropout rate for all models in these experiments. They acknowledge that optimizing regularization (varying dropout rate with model and dataset size) might change the quantitative results, but expect the qualitative picture — sub-linear data scaling — to hold. This is an important caveat: the specific exponent 0.74 is measured under one particular training recipe, and might shift slightly under different regularization choices.

Let's make this concrete. With equation (4.4) from the paper, to avoid overfitting you need:

D ≥ 5,000 · N^0.74

For a 1B parameter model: D ≥ 5000 × (10⁹)^0.74 ≈ 5000 × 1.1 × 10⁶ ≈ 5.5 × 10⁹ tokens (about 5.5 billion tokens). The WebText2 training set had 22 billion tokens, so a 1B-parameter model trains comfortably without overfitting. But a 10B-parameter model would need D ≥ 5000 × (10¹⁰)^0.74 ≈ 30 billion tokens, already pushing the boundary of the training set.

The paper fit this combined equation to experiments spanning from 22 million to 22 billion tokens of training data and from tiny to billion-parameter models. The fits were excellent, except for extremely small datasets (~20 million tokens), which may represent a qualitatively different regime where overfitting happens within the first few training steps.

An important detail: the paper also found that transfer loss — performance on text distributions different from the training set (books, Wikipedia, Common Crawl) — scales in parallel with the training distribution. There is a constant offset (the new distribution is harder or easier), but the slope is the same. This means scaling laws predict out-of-distribution performance too, not just performance on the training set. Generalization improves predictably with scale.

If you increase your model size by 8x, how much more data do you need to keep overfitting constant?

8x (linear scaling) ~5x (sub-linear scaling) 2x (logarithmic scaling)

Chapter 5: Training Curves Are Universal

Watch a neural network train. Loss starts high and drops rapidly at first, then slows. This learning curve has a characteristic shape. What Kaplan et al. discovered is that this shape is essentially the same for all model sizes.

After an initial transient period (the first few thousand steps, where the learning rate is warming up), all learning curves can be fit by:

L(N, S_min) = (N_c/N)^α_N + (S_c/S_min)^α_S

where S_min is the number of training steps (adjusted to the optimal batch size), S_c ≈ 2.1 × 10³, and α_S ≈ 0.76.

This equation has two additive terms, and the additive structure is key to understanding it.

The first term, (N_c/N)^α_N, is the irreducible loss for model size N — the best you could ever achieve by training this model forever on infinite data. It represents the model's fundamental capacity limit. A 10M-parameter model simply cannot represent certain patterns in language, no matter how long you train it.

The second term, (S_c/S_min)^α_S, is the training deficit — how far you are from that floor because you haven't trained long enough. Early in training, this term dominates. Late in training, it shrinks and the model's capacity limit takes over.

The additive form means the two sources of error are independent. You can be limited by model size, limited by training time, or (in the worst case) limited by both. The optimal allocation of compute tries to balance these two terms so neither dominates excessively.

Why this is powerful: By fitting the early portion of a training curve, you can predict the loss that would be achieved by training for 10x or 100x longer. This means you can make expensive training decisions based on cheap short experiments.

The universality runs deep. The training-time exponent α_S ≈ 0.76 is the same for all model sizes. The constant S_c is the same for all model sizes. Only the irreducible loss floor changes with N. This means all models follow parallel tracks on a log-loss vs. log-steps plot — larger models start lower and stay lower, but the rate of improvement per step is universal.

The paper speculates that this universality reflects something about the loss landscape of Transformers. Since the power-law fits are best late in training (when the loss surface may be approximately quadratic), the exponent α_S likely encodes information about the spectrum of the Hessian matrix — specifically, the density of eigenvalues. The fact that this spectrum appears to be roughly independent of model size is a deep and unexplained empirical regularity.

A practical consequence: if you are running a hyperparameter search, you can train many small models for short runs, measure their learning curves, and use the universal learning curve equation to predict which configuration would perform best at full scale. This is a massive cost savings — you replace one expensive run with many cheap ones.

What about early stopping? The paper shows you can predict when to stop training. If you know L(N, D) — the loss you'll achieve with model size N and dataset D — and L(N, ∞) — the loss with infinite data — then the gap between them tells you when overfitting has begun to dominate. The optimal early stopping point S_stop satisfies an inequality derived from the learning curve equation, allowing you to estimate in advance how many steps a training run needs.

There is also a notion of critical batch size B_crit, which follows its own power law:

B_crit(L) = B_* / L^1/α_B

with B_* ≈ 2 × 10⁸ tokens and α_B ≈ 0.21. Below B_crit, increasing batch size is nearly free (same total compute, fewer steps). Above it, you get diminishing returns. The critical batch size depends only on the current loss, not on model size — another sign of universality.

The critical batch size has a physical interpretation from McCandlish et al. (2018). It represents the point where the gradient noise scale — the ratio of gradient variance to squared gradient norm — equals the batch size. Below this threshold, each additional sample in the batch provides genuinely new gradient information. Above it, you're mostly averaging out noise, with diminishing returns per additional sample. For the largest models Kaplan et al. trained, B_crit was about 1-2 million tokens — big, but far smaller than modern training batch sizes.

What does the universality of training curves allow you to do in practice?

Predict final loss from early training data Skip training entirely and compute the loss analytically Train without any data at all

Chapter 6: Compute-Optimal Allocation

This is the chapter that launched a thousand GPU clusters. Given a fixed compute budget C, how should you split it between model size, batch size, and number of training steps?

The answer comes from minimizing L(N, S_min) subject to the constraint C_min = 6N · B · S. The paper finds:

N ∝ C^0.73

Model size should grow rapidly with compute

↓

B ∝ C^0.24

Batch size grows modestly

↓

S ∝ C^0.03

Training steps barely grow at all

The shocking conclusion: When compute increases by 1000x, almost all of it should go to making the model bigger (N grows ~200x). Data grows modestly (~7x). Training time barely changes (~1.07x). You should train enormous models for surprisingly few steps.

This means optimal training stops far short of convergence. You build the biggest model your compute budget allows, train it on a modest dataset for a modest number of steps, and stop. This is enormously more sample-efficient than the conventional approach of training smaller models to convergence.

Think about what this means for a lab planning a large training run. The conventional approach would be: pick a model size, gather as much data as possible, and train until the loss plateaus. The scaling law approach is radically different: given your compute budget, the optimal strategy is to pick a much larger model than you think you need, and train it for fewer steps than you think are necessary. You are deliberately under-training an over-sized model — and this gives you the best loss per FLOP.

The paper quantifies the "zone of near-optimality" around the ideal N. Models between 0.6x and 2.2x the optimal size can be trained with at most a 20% compute overhead. So you do not need to hit the exact optimal model size — there is a comfortable range. But being 10x off (too small or too large) is genuinely wasteful.

The theoretical prediction matches empirical observation almost perfectly. The paper derives:

α_C^min = 1 / (1/α_S + 1/α_B + 1/α_N) ≈ 0.050

and the predicted exponents for N, B, and S as functions of C_min match the empirical fits to within a few percent. The scaling framework is internally consistent — the independently measured power laws all connect together correctly.

Let's work through a concrete example. Suppose you have a budget of 10³ PF-days (about 10,000 A100-days). The scaling laws predict:

N_opt ≈ 1.3 × 10⁹ × (10³)^0.73 ≈ 3.7 × 10¹¹

About 370 billion parameters. Build a massive model.

↓

D ≈ C / (6N) ≈ 2.4 × 10¹¹ tokens

About 240 billion tokens. Modest for a model this size.

↓

L ≈ (3.1 × 10⁸ / 10³)^0.050 ≈ 1.94 nats

Loss of about 1.94 nats/token.

There is an important subtlety. The paper estimates that compute-efficient training uses data D ∝ C^0.27, growing slowly with compute. But to avoid overfitting, you need D ∝ N^0.74 ∝ C^0.54, which grows much faster. These two constraints eventually contradict each other at around C^* ~ 10⁴ PF-days, N^* ~ 10¹² parameters, D^* ~ 10¹² tokens, and L^* ~ 1.7 nats/token. The authors conjectured this intersection might represent a fundamental limit — the point at which you have extracted all learnable structure from natural language.

This conjecture is fascinating because it connects scaling laws to the entropy of natural language. If L^* ≈ 1.7 nats/token, that would imply English text has an entropy of about 1.7 nats ≈ 2.4 bits per token. Shannon's original 1951 estimate was about 1-1.5 bits per character, which translates to roughly 4-6 bits per token (at ~4 characters per token) — in the same ballpark, but higher. The truth is uncertain because the extrapolation spans many orders of magnitude beyond observed data.

As of 2024, we have reached and surpassed N^* = 10¹² parameters (GPT-4 is rumored to be in this range). And the scaling laws have not broken down in the catastrophic way the paper feared. The resolution seems to be that new, higher-quality data sources (code, curated web, synthetic data) and better training recipes have shifted the constants, extending the power laws further than the 2020 extrapolation predicted. The framework was right; the specific numbers were bound to a particular data regime.

If your compute budget increases by 10x, what should you primarily do?

Make the model ~5x bigger Train for 10x as long on the same model Use 10x more data on the same model

Chapter 7: Showcase — Scaling Law Explorer

Now let's see the scaling laws in action. The interactive plot below shows three views: loss vs. parameters, loss vs. data, and loss vs. compute. Each view renders the power-law relationship as a straight line on log-log axes, with simulated empirical data points scattered along it.

The key interactive element is the compute budget slider. Drag it to change your total compute budget from 10^-8 PF-days (a few seconds of training) to 10² PF-days (months on a GPU cluster). Watch how the optimal model size, dataset size, and achievable loss change as you move the slider.

On the "Loss vs N" tab, notice how the optimal-N marker jumps rapidly rightward as you increase compute — this visualizes N ∝ C^0.73. On the "Loss vs D" tab, the optimal-D marker moves much more slowly — D ∝ C^0.27. This asymmetry is the paper's central practical finding.

Use the tabs to switch between views. Adjust the compute budget to watch the optimal allocation change in real time.

Compute Budget (log₁₀ PF-days): -2.0

    Optimal N: --
    Optimal D: --
    Loss: --
  

What to notice: As you drag the compute slider right (more compute), the optimal model size on the "Loss vs N" tab jumps rightward rapidly, while the optimal data size on "Loss vs D" moves much more slowly. This visualizes the paper's key finding: spend extra compute on bigger models, not more data. Also notice that the "empirical" scatter points cluster tightly around the power-law line — this is the remarkable empirical regularity that makes the whole framework work.

Try this experiment: Set the compute slider to -6 (10^-6 PF-days, a short training run) and note the optimal N and loss. Then slide to 0 (1 PF-day, a serious training run). The optimal N grows by roughly 10^4.4 ≈ 25,000x, while the loss drops from ~3.5 to ~2.1. Now slide to +2 (100 PF-days). Another ~100x in N, but the loss only drops from ~2.1 to ~1.8. Diminishing returns in loss, but massive growth in optimal model size — the power law at work.

To connect the simulation to real systems, here are some landmarks on the compute axis:

10^-5 PF-days

Tiny experiment, ~1 GPU-minute. Good for unit tests and debugging.

↓

10^-2 PF-days

Small research run, ~1 GPU-day. Enough to train a ~100M parameter model.

↓

10¹ PF-days

Serious training, ~1000 GPU-days. GPT-2 scale. Multi-billion parameter models.

↓

10⁴ PF-days

Frontier training, millions of GPU-hours. GPT-4 / Claude scale. Beyond the paper's data.

The beauty of the scaling law framework is that even though this paper only measured up to ~10² PF-days, the straight-line extrapolation to 10⁴ has proven remarkably accurate. The power laws predicted GPT-3's performance before it was trained. They are the closest thing to a crystal ball that deep learning has produced.

Looking at the explorer, when you increase compute by 10x, by roughly how much does the optimal model size increase?

About 5x (N ∝ C^0.73) About 10x (linear) About 2x (slow growth)

Chapter 8: Legacy and Chinchilla

This paper's influence cannot be overstated. It established scaling as a first-class research direction in machine learning — not just "let's make things bigger and hope," but "here are the exact equations that predict how much better things get as you scale." Before Kaplan et al., researchers focused on architectural innovations — better attention mechanisms, better normalization, clever training tricks. After this paper, the field realized that simply making things bigger, if done correctly, was often more effective than any architectural tweak.

The paper's methodology also set a standard. The careful controlled experiments — varying one factor at a time, holding others fixed, plotting on log-log axes, fitting power laws — became the template for all subsequent scaling studies. The field of "scaling science" or "scaling research" essentially began here.

GPT-3, announced just five months later (May 2020), was built directly on these scaling laws. Its 175 billion parameters and the decision to train on "only" 300 billion tokens (stopping well before convergence on that data) were informed by the predictions in this paper. Several of the paper's authors — Tom Brown, Alec Radford, Rewon Child, Scott Gray, Jeff Wu, Benjamin Chess, and Dario Amodei — went on to co-author the GPT-3 paper. The scaling laws literally told them how big to build it.

GPT-3's remarkable few-shot capabilities validated the approach: scale up, and emergent abilities appear. Tasks that were impossible at 1B parameters became possible at 175B, not through any architectural innovation, but through pure scale. The scaling laws had predicted the loss correctly; what they hadn't predicted was that lower loss would unlock qualitatively new capabilities like in-context learning.

The Chinchilla correction (2022). Two years later, Hoffmann et al. at DeepMind revisited the compute-optimal allocation question. They found that Kaplan et al. had under-estimated the importance of data. Where Kaplan predicted N ∝ C^0.73 and D ∝ C^0.27, Chinchilla found roughly N ∝ C^0.50 and D ∝ C^0.50 — parameters and data should scale equally. This meant GPT-3 was over-sized and under-trained.

Chinchilla's key result: a 70B parameter model trained on 1.4 trillion tokens outperforms the 280B parameter Gopher model trained on 300 billion tokens, despite using the same compute budget. The smaller model was better because it was trained on more data. Under the Kaplan recipe, Gopher was over-sized and under-trained.

Why the discrepancy? Several factors:

Learning rate schedule

Kaplan used a fixed schedule for all runs. Chinchilla tuned the schedule per run, discovering that longer training with proper scheduling yields more gains.

↓

Last token vs. average

Small differences in how loss is measured (averaged over context vs. last token) can shift the optimal allocation.

↓

Scale range

Chinchilla experiments extended to larger scales, where data efficiency may change.

But here is what matters most: the framework is the same. Both papers agree that scaling laws exist, that they are power laws, and that they can guide compute allocation. The disagreement is only about the precise exponents — how fast N should grow versus D — not the paradigm itself. Kaplan et al. built the framework. Chinchilla refined the numbers. Both papers are right that you should use power laws to plan your training runs. They just disagree on the optimal split.

The practical consequence of Chinchilla was immediate. LLaMA (Meta, 2023) was explicitly designed to be "Chinchilla-optimal" — a 65B parameter model trained on 1.4 trillion tokens, rather than a 175B model trained on 300 billion tokens like GPT-3. LLaMA matched or exceeded GPT-3 performance at a fraction of the inference cost, because smaller models are cheaper to serve.

This highlights a distinction the original Kaplan paper did not emphasize: training compute vs. inference compute. The Kaplan recipe optimizes for training efficiency — best loss per training FLOP. But once trained, a large model costs more to run on every query. If you plan to serve millions of users, the inference cost dominates, and you want the smallest model that achieves your target loss. Chinchilla's "train more data, use fewer parameters" approach better serves this goal.

Today, the scaling laws paradigm has expanded far beyond language. Similar power laws have been found for vision models (ViT), multimodal models (CLIP, Flamingo), code generation (Codex), protein folding (AlphaFold), and even scientific simulation. The principle that performance is a smooth, predictable function of scale has become one of the most reliable empirical facts in deep learning.

More recent work has extended scaling laws to include inference-time compute (how much thinking the model does at test time, as in chain-of-thought or tree search), post-training (RLHF, instruction tuning, where performance also scales predictably with reward model size and preference data), and mixture-of-experts architectures (where active parameters and total parameters follow different scaling curves). The framework keeps generalizing. Kaplan et al. gave us the grammar; the field is still writing sentences with it.

Perhaps the deepest legacy is cultural. Before this paper, the AI research community valued clever ideas above all else — a new attention variant, a better loss function, a smarter data augmentation trick. After this paper, the community also valued scale as a legitimate research strategy.

This shift led directly to the current era of frontier models, where the binding constraint is not ideas but compute, data, and engineering. Whether this is a good thing is debated — it has certainly raised the barrier to entry for academic researchers who lack access to massive GPU clusters. But the empirical reality is undeniable: the scaling laws work. Models built according to their predictions consistently outperform models designed with clever tricks but insufficient scale.

One final thought. The scaling laws tell us that loss improves predictably with scale. But they do not tell us what capabilities emerge at each loss level. The jump from "can complete sentences" to "can write essays" to "can reason about code" happens at specific loss thresholds that the scaling laws do not predict. Understanding the relationship between loss and capabilities remains one of the most important open problems in AI research — a frontier that Kaplan et al.'s framework points toward but does not cross.

How did the Chinchilla paper (2022) revise Kaplan et al.'s compute-optimal recipe?

It said scaling laws don't exist It found that data should scale equally with parameters, not much slower It found that smaller models are always better

Chapter 9: Connections

This paper sits at a critical junction in the history of AI scaling. It took an empirical observation — bigger models work better — and turned it into a quantitative science. Let's map its connections to the broader landscape.

Before this paper, "scaling up" was a heuristic, a belief, a bet. After it, scaling became a research program with quantitative predictions. You could write down an equation, plug in your compute budget, and know what loss to expect. This transformed AI from a field driven by architectural innovation to one driven (at least partially) by resource allocation.

The paper is also philosophically significant. It suggests that the details of neural network architecture matter far less than the brute fact of scale. Width, depth, attention heads — these are secondary. What matters is how many parameters, how much data, how much compute. This echoes Rich Sutton's "Bitter Lesson" (2019): in the long run, methods that leverage computation scale better than methods that leverage human knowledge about the domain.

It is worth noting that several authors of this paper went on to co-found Anthropic (Dario Amodei, Sam McCandlish, Jared Kaplan, Tom Brown, among others). The scaling laws framework was not just an academic exercise — it was foundational to the strategy of building increasingly capable AI systems. The paper's predictions helped guide billions of dollars of compute investment across the industry.

Builds on

Attention Is All You Need (Vaswani et al., 2017) — the Transformer architecture that these scaling laws characterize. All experiments use decoder-only Transformers.

GPT-2 (Radford et al., 2019) — the WebText dataset and training setup that provided the experimental foundation.

↓

Directly enabled

GPT-3 (Brown et al., 2020) — 175B parameters, explicitly designed using these scaling laws. Demonstrated emergent few-shot learning at scale.

Chinchilla (Hoffmann et al., 2022) — refined the compute-optimal allocation, leading to better-trained models like LLaMA.

↓

Broader impact

Scaling laws for other domains — vision (ViT scaling), multimodal (CLIP), code (Codex), science (AlphaFold). The framework generalizes.

The "Bitter Lesson" (Rich Sutton, 2019) — the philosophical precursor. Sutton argued that methods leveraging computation always win. This paper provided the quantitative proof.

A summary of the paper's key equations, for reference:

L(N) = (N_c/N)^α_N

α_N ≈ 0.076, N_c ≈ 8.8 × 10¹³

↓

L(D) = (D_c/D)^α_D

α_D ≈ 0.095, D_c ≈ 5.4 × 10¹³

↓

L(C_min) = (C_c/C_min)^α_C

α_C ≈ 0.050, C_c ≈ 3.1 × 10⁸ PF-days

↓

L(N, D) = [(N_c/N)^α_N/α_D + D_c/D]^α_D

The unified equation governing overfitting

↓

N_opt ∝ C^0.73, D_opt ∝ C^0.27

Compute-optimal allocation (revised by Chinchilla to ~C^0.5 each)

Paper details. "Scaling Laws for Neural Language Models," Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei. OpenAI / Johns Hopkins University, January 2020. arXiv:2001.08361.

← Back to Veanors Hub

What paper directly refined the compute-optimal training recipe proposed by Kaplan et al.?

GPT-3 Chinchilla (Hoffmann et al., 2022) BERT

Scaling Laws forNeural Language Models