This note is about what happens as models get bigger. Scaling laws are the empirical finding that loss falls along a smooth, predictable power-law curve as you add parameters, data, and compute — which is why labs keep scaling. It covers the Kaplan and Chinchilla laws and their compute-optimal tradeoff (about 20 tokens per parameter), then emergence: the claim that some abilities appear suddenly at scale, and the live debate over whether those jumps are real or artifacts of measurement. It builds directly on pretraining (Foundation 03) — scaling laws govern how much of it to do — and excludes the architecture (Foundation 02) and inference (Foundation 05).
Pretraining works, but it is astronomically expensive — so before committing hundreds of millions of dollars and months of compute, a lab wants to know two things. If I make the model twice as big, or train on twice as much data, how much better will it get? And given a fixed compute budget, should I spend it on a bigger model or on more data? Guessing wrong wastes a fortune.
Scaling laws answer the first question — performance turns out to be surprisingly predictable — and compute-optimal scaling answers the second. Then comes a stranger question: as models grow, do they simply get smoothly better, or do genuinely new abilities switch on at particular sizes?
The foundational result, from Kaplan and colleagues at OpenAI in 2020, is that a model's test loss falls along smooth power-law curves as you increase parameters, dataset size, or compute.[1] On a log-log plot those curves are nearly straight lines, which means loss is forecastable — you can estimate how a larger run will perform before paying for it. Their early reading also leaned toward size: make models as large as possible and train on relatively little data.[1]
For transformers, compute is roughly C ≈ 6ND (N parameters, D tokens), so a fixed budget forces a split between model size and data.[4] In 2022, DeepMind's Chinchilla study found that — contrary to the size-first view — parameters and training tokens should scale roughly equally, about 20 tokens per parameter.[3][5] Too small a model on plenty of data underfits; too large a model on too little data is wasted. The optimum balances the two.[4]
The headline demonstration: a 70-billion-parameter Chinchilla, trained on far more data, beat the 280-billion-parameter Gopher at the same compute budget.[5] By the 20:1 rule, GPT-3 (175B parameters on roughly 300B tokens) had been badly undertrained — it should have been about 15B parameters, or trained on around 3.5 trillion tokens.[3] The practical payoff is large: a smaller, well-fed model is not just competitive but cheaper to run at inference.
Kaplan's size-first conclusion and Chinchilla's balance looked contradictory (N ∝ C0.73 versus C0.50), but later work traced most of the gap to Kaplan counting only non-embedding parameters at small scale; the field now broadly follows Chinchilla-style balance.[2] The durable lesson is that pretraining outcomes are largely predictable from scale — which is precisely why the frontier keeps advancing by spending more.
Alongside smooth scaling, Wei and colleagues catalogued what they called emergent abilities — capabilities that appear only once a model is large enough, and that cannot be reliably extrapolated from smaller models.[7] Such abilities looked both sharp (off, then suddenly on) and unpredictable (no warning at what size they'd appear).[6]
Schaeffer and colleagues pushed back: many of those sudden jumps may be artifacts of how performance is measured, not of the model itself.[6] Under a discontinuous metric like exact-match, a model that is quietly, smoothly improving will appear to do nothing until it crosses a threshold and then leaps — and most claimed emergent abilities on one benchmark appeared under just such metrics, with the curves smoothing out under continuous ones.[6] The classic example is multi-digit addition: per-digit accuracy climbs gradually, but exact-match on the whole number stays near zero until nearly every digit is right, then jumps.[9] The debate isn't fully settled — a few tasks show sharp jumps even under continuous metrics — but the field now treats “emergence” as a claim to scrutinize, not a given.[8]
Scaling laws turn pretraining from a gamble into a budgeting problem: pick a compute budget, split it between parameters and tokens in the right ratio, and the loss you'll reach is largely foreseeable.
The throughline: more compute, allocated well, reliably buys lower loss — that predictability is what lets labs keep pushing the frontier. The emergence question feeds straight into evaluation: if the choice of metric can manufacture a “sudden ability,” then how you measure decides what you see — the same lesson as the golden-dataset note one tier up. And the compute-optimal insight that a smaller, data-rich model can win is part of what makes efficient inference (Foundation 05) economically possible, since that smaller model is cheaper to serve.
Scaling laws describe loss, not usefulness. They are empirical regularities, not guarantees — a lower loss does not automatically mean a more truthful, aligned, or safe model, and the curves can bend when high-quality data runs short. Predictable is not the same as unlimited.
“Emergence” is a contested word. Depending on the metric you pick, the same smooth underlying progress can look like a sudden leap or a gentle slope. Treat dramatic “ability switched on at size X” plots with care, and always ask what was being measured.