Scaling Laws & Emergence: Why Bigger Gets Better — and Whether New Abilities Appear

TL;DR — Scope

This note is about what happens as models get bigger. Scaling laws are the empirical finding that loss falls along a smooth, predictable power-law curve as you add parameters, data, and compute — which is why labs keep scaling. It covers the Kaplan and Chinchilla laws and their compute-optimal tradeoff (about 20 tokens per parameter), then emergence: the claim that some abilities appear suddenly at scale, and the live debate over whether those jumps are real or artifacts of measurement. It builds directly on pretraining (Foundation 03) — scaling laws govern how much of it to do — and excludes the architecture (Foundation 02) and inference (Foundation 05).

The Problem

Pretraining works, but it is astronomically expensive — so before committing hundreds of millions of dollars and months of compute, a lab wants to know two things. If I make the model twice as big, or train on twice as much data, how much better will it get? And given a fixed compute budget, should I spend it on a bigger model or on more data? Guessing wrong wastes a fortune.

Scaling laws answer the first question — performance turns out to be surprisingly predictable — and compute-optimal scaling answers the second. Then comes a stranger question: as models grow, do they simply get smoothly better, or do genuinely new abilities switch on at particular sizes?

FIG. 1 — The power law. Plotted on log-log axes, test loss falls in a nearly straight line as compute grows, so a bigger run's performance can be predicted in advance. Schematic.

The Concepts

Scaling Laws — the Power Law

The foundational result, from Kaplan and colleagues at OpenAI in 2020, is that a model's test loss falls along smooth power-law curves as you increase parameters, dataset size, or compute.^[1] On a log-log plot those curves are nearly straight lines, which means loss is forecastable — you can estimate how a larger run will perform before paying for it. Their early reading also leaned toward size: make models as large as possible and train on relatively little data.^[1]

The Compute-Optimal Tradeoff

For transformers, compute is roughly C ≈ 6ND (N parameters, D tokens), so a fixed budget forces a split between model size and data.^[4] In 2022, DeepMind's Chinchilla study found that — contrary to the size-first view — parameters and training tokens should scale roughly equally, about 20 tokens per parameter.^[3]^[5] Too small a model on plenty of data underfits; too large a model on too little data is wasted. The optimum balances the two.^[4]

FIG. 2 — The compute-optimal balance. For a fixed budget, both an over-small model (underfit) and an over-large model starved of data are sub-optimal; loss bottoms out where parameters and tokens are balanced — roughly 20 tokens per parameter. Schematic.

The Undertraining Correction

The headline demonstration: a 70-billion-parameter Chinchilla, trained on far more data, beat the 280-billion-parameter Gopher at the same compute budget.^[5] By the 20:1 rule, GPT-3 (175B parameters on roughly 300B tokens) had been badly undertrained — it should have been about 15B parameters, or trained on around 3.5 trillion tokens.^[3] The practical payoff is large: a smaller, well-fed model is not just competitive but cheaper to run at inference.

FIG. 3 — Smaller, better fed, better. At equal compute, Chinchilla's quarter-size model trained on far more data outperformed Gopher — the demonstration that the field had been undertraining. Approximate figures.

Reconciliation & Upshot

Kaplan's size-first conclusion and Chinchilla's balance looked contradictory (N ∝ C^0.73 versus C^0.50), but later work traced most of the gap to Kaplan counting only non-embedding parameters at small scale; the field now broadly follows Chinchilla-style balance.^[2] The durable lesson is that pretraining outcomes are largely predictable from scale — which is precisely why the frontier keeps advancing by spending more.

Emergence

Alongside smooth scaling, Wei and colleagues catalogued what they called emergent abilities — capabilities that appear only once a model is large enough, and that cannot be reliably extrapolated from smaller models.^[7] Such abilities looked both sharp (off, then suddenly on) and unpredictable (no warning at what size they'd appear).^[6]

The Mirage Debate

Schaeffer and colleagues pushed back: many of those sudden jumps may be artifacts of how performance is measured, not of the model itself.^[6] Under a discontinuous metric like exact-match, a model that is quietly, smoothly improving will appear to do nothing until it crosses a threshold and then leaps — and most claimed emergent abilities on one benchmark appeared under just such metrics, with the curves smoothing out under continuous ones.^[6] The classic example is multi-digit addition: per-digit accuracy climbs gradually, but exact-match on the whole number stays near zero until nearly every digit is right, then jumps.^[9] The debate isn't fully settled — a few tasks show sharp jumps even under continuous metrics — but the field now treats “emergence” as a claim to scrutinize, not a given.^[8]

FIG. 4 — Real ability or measurement artifact? The identical underlying progress can look like a cliff under a pass/fail metric and a gentle slope under a continuous one. The metric, not just the model, shapes the story. Schematic.

How It All Fits Together

Scaling laws turn pretraining from a gamble into a budgeting problem: pick a compute budget, split it between parameters and tokens in the right ratio, and the loss you'll reach is largely foreseeable.

Budget

Compute

FLOPs you can afford

→

Scaling law

Allocate N : D

≈ 1 : 20 tokens/param

→

Run

Pretrain

Foundation 03

→

Result

Loss ↓ (predictable)

+ capabilities

FIG. 5 — Scaling as budgeting. The scaling law sits on top of pretraining, deciding how to spend compute; the resulting loss is forecastable, and capabilities follow.

The throughline: more compute, allocated well, reliably buys lower loss — that predictability is what lets labs keep pushing the frontier. The emergence question feeds straight into evaluation: if the choice of metric can manufacture a “sudden ability,” then how you measure decides what you see — the same lesson as the golden-dataset note one tier up. And the compute-optimal insight that a smaller, data-rich model can win is part of what makes efficient inference (Foundation 05) economically possible, since that smaller model is cheaper to serve.

①

Scaling laws describe loss, not usefulness. They are empirical regularities, not guarantees — a lower loss does not automatically mean a more truthful, aligned, or safe model, and the curves can bend when high-quality data runs short. Predictable is not the same as unlimited.

②

“Emergence” is a contested word. Depending on the metric you pick, the same smooth underlying progress can look like a sudden leap or a gentle slope. Treat dramatic “ability switched on at size X” plots with care, and always ask what was being measured.

Scaling Laws & Emergence Why Bigger Gets Better — and Whether New Abilities Really Appear