Research Note  ·  Foundations / Scaling

Scaling Laws & Emergence Why Bigger Gets Better — and Whether New Abilities Really Appear

Foundations · 04 of the LLM Foundations series · Tier 1
TL;DR — Scope

This note is about what happens as models get bigger. Scaling laws are the empirical finding that loss falls along a smooth, predictable power-law curve as you add parameters, data, and compute — which is why labs keep scaling. It covers the Kaplan and Chinchilla laws and their compute-optimal tradeoff (about 20 tokens per parameter), then emergence: the claim that some abilities appear suddenly at scale, and the live debate over whether those jumps are real or artifacts of measurement. It builds directly on pretraining (Foundation 03) — scaling laws govern how much of it to do — and excludes the architecture (Foundation 02) and inference (Foundation 05).

01

The Problem

Pretraining works, but it is astronomically expensive — so before committing hundreds of millions of dollars and months of compute, a lab wants to know two things. If I make the model twice as big, or train on twice as much data, how much better will it get? And given a fixed compute budget, should I spend it on a bigger model or on more data? Guessing wrong wastes a fortune.

Scaling laws answer the first question — performance turns out to be surprisingly predictable — and compute-optimal scaling answers the second. Then comes a stranger question: as models grow, do they simply get smoothly better, or do genuinely new abilities switch on at particular sizes?

loss compute (log scale) → test loss (log) small medium large straight line on log-log = power law → loss is forecastable before you pay
FIG. 1 — The power law. Plotted on log-log axes, test loss falls in a nearly straight line as compute grows, so a bigger run's performance can be predicted in advance. Schematic.
02

The Concepts

Scaling Laws — the Power Law

The foundational result, from Kaplan and colleagues at OpenAI in 2020, is that a model's test loss falls along smooth power-law curves as you increase parameters, dataset size, or compute.[1] On a log-log plot those curves are nearly straight lines, which means loss is forecastable — you can estimate how a larger run will perform before paying for it. Their early reading also leaned toward size: make models as large as possible and train on relatively little data.[1]

The Compute-Optimal Tradeoff

For transformers, compute is roughly C ≈ 6ND (N parameters, D tokens), so a fixed budget forces a split between model size and data.[4] In 2022, DeepMind's Chinchilla study found that — contrary to the size-first view — parameters and training tokens should scale roughly equally, about 20 tokens per parameter.[3][5] Too small a model on plenty of data underfits; too large a model on too little data is wasted. The optimum balances the two.[4]

loss model size (fixed compute) → compute-optimal ≈ 20 tokens / parameter too small → underfit too big, too few tokens
FIG. 2 — The compute-optimal balance. For a fixed budget, both an over-small model (underfit) and an over-large model starved of data are sub-optimal; loss bottoms out where parameters and tokens are balanced — roughly 20 tokens per parameter. Schematic.

The Undertraining Correction

The headline demonstration: a 70-billion-parameter Chinchilla, trained on far more data, beat the 280-billion-parameter Gopher at the same compute budget.[5] By the 20:1 rule, GPT-3 (175B parameters on roughly 300B tokens) had been badly undertrained — it should have been about 15B parameters, or trained on around 3.5 trillion tokens.[3] The practical payoff is large: a smaller, well-fed model is not just competitive but cheaper to run at inference.

training tokens (shared scale) · same compute budget Gopher 280B params ≈ 0.3T tokens Chinchilla 70B params ≈ 1.4T tokens ✓ ¼ the parameters · 4× the data · beats the larger model
FIG. 3 — Smaller, better fed, better. At equal compute, Chinchilla's quarter-size model trained on far more data outperformed Gopher — the demonstration that the field had been undertraining. Approximate figures.

Reconciliation & Upshot

Kaplan's size-first conclusion and Chinchilla's balance looked contradictory (N ∝ C0.73 versus C0.50), but later work traced most of the gap to Kaplan counting only non-embedding parameters at small scale; the field now broadly follows Chinchilla-style balance.[2] The durable lesson is that pretraining outcomes are largely predictable from scale — which is precisely why the frontier keeps advancing by spending more.

Emergence

Alongside smooth scaling, Wei and colleagues catalogued what they called emergent abilities — capabilities that appear only once a model is large enough, and that cannot be reliably extrapolated from smaller models.[7] Such abilities looked both sharp (off, then suddenly on) and unpredictable (no warning at what size they'd appear).[6]

The Mirage Debate

Schaeffer and colleagues pushed back: many of those sudden jumps may be artifacts of how performance is measured, not of the model itself.[6] Under a discontinuous metric like exact-match, a model that is quietly, smoothly improving will appear to do nothing until it crosses a threshold and then leaps — and most claimed emergent abilities on one benchmark appeared under just such metrics, with the curves smoothing out under continuous ones.[6] The classic example is multi-digit addition: per-digit accuracy climbs gradually, but exact-match on the whole number stays near zero until nearly every digit is right, then jumps.[9] The debate isn't fully settled — a few tasks show sharp jumps even under continuous metrics — but the field now treats “emergence” as a claim to scrutinize, not a given.[8]

the same model family · two different metrics exact-match → looks “sudden” score continuous → smooth, predictable scale → scale →
FIG. 4 — Real ability or measurement artifact? The identical underlying progress can look like a cliff under a pass/fail metric and a gentle slope under a continuous one. The metric, not just the model, shapes the story. Schematic.
03

How It All Fits Together

Scaling laws turn pretraining from a gamble into a budgeting problem: pick a compute budget, split it between parameters and tokens in the right ratio, and the loss you'll reach is largely foreseeable.

Budget
Compute
FLOPs you can afford
Scaling law
Allocate N : D
≈ 1 : 20 tokens/param
Run
Pretrain
Foundation 03
Result
Loss ↓ (predictable)
+ capabilities
FIG. 5 — Scaling as budgeting. The scaling law sits on top of pretraining, deciding how to spend compute; the resulting loss is forecastable, and capabilities follow.

The throughline: more compute, allocated well, reliably buys lower loss — that predictability is what lets labs keep pushing the frontier. The emergence question feeds straight into evaluation: if the choice of metric can manufacture a “sudden ability,” then how you measure decides what you see — the same lesson as the golden-dataset note one tier up. And the compute-optimal insight that a smaller, data-rich model can win is part of what makes efficient inference (Foundation 05) economically possible, since that smaller model is cheaper to serve.

Scaling laws describe loss, not usefulness. They are empirical regularities, not guarantees — a lower loss does not automatically mean a more truthful, aligned, or safe model, and the curves can bend when high-quality data runs short. Predictable is not the same as unlimited.

“Emergence” is a contested word. Depending on the metric you pick, the same smooth underlying progress can look like a sudden leap or a gentle slope. Treat dramatic “ability switched on at size X” plots with care, and always ask what was being measured.

References

  1. Simulations4AllLLM Scaling Laws Visualizer (Kaplan & Chinchilla) — https://simulations4all.com/simulations/llm-scaling-laws-visualizer
  2. arXiv · Pearce & SongReconciling Kaplan and Chinchilla Scaling Laws — https://arxiv.org/pdf/2406.12907
  3. Medium · R. HossamChinchilla Scaling Laws for Large Language Models — https://medium.com/@raniahossam/chinchilla-scaling-laws-for-large-language-models-llms-40c434e4e1c1
  4. EmergentMindChinchilla Scaling Law Overview — https://www.emergentmind.com/topics/chinchilla-scaling-law
  5. M. BrenndoerferChinchilla Scaling Laws: Compute-Optimal LLM Training — https://mbrenndoerfer.com/writing/chinchilla-scaling-laws-compute-optimal-llm-training
  6. arXiv · Schaeffer et al.Are Emergent Abilities of Large Language Models a Mirage? — https://arxiv.org/abs/2304.15004
  7. arXiv · surveyEmergent Abilities in Large Language Models: A Survey (Wei et al. discussion) — https://arxiv.org/html/2503.05788v1
  8. arXiv · surveyEmergent Abilities Survey — smooth loss vs sharp accuracy, remaining exceptions — https://arxiv.org/html/2503.05788v1
  9. DhiriaEmergent abilities in LLMs: reality or mirage? — https://www.dhiria.com/en/blog/emergent-abilities-in-large-language-models-reality-or-mirage