This note covers how a randomly-initialized transformer becomes a capable model. The training objective is deceptively simple — predict the next token — repeated over web-scale text. It spans the self-supervised objective, the cross-entropy loss that scores predictions, the gradient-descent loop that improves them, the data pipeline that feeds it, what the model actually learns, and the split between pretraining and fine-tuning (a base model vs an assistant). It excludes the architecture itself (Foundation 02) and how a finished model generates text at inference (Foundation 05).
After Foundation 02 we have a transformer, but its weights are random — it knows nothing. We need it to learn language: grammar, meaning, facts, even reasoning. The obvious approach, supervised learning, needs labeled examples — yet no one can hand-label a meaningful fraction of human text.
The breakthrough is to pick a task where the text labels itself: predict the next token. Given a stretch of text, the correct answer at every position is simply the token that actually comes next.[1] That turns ordinary writing into a supervised problem with free labels — which is why pretraining is called self-supervised — and makes essentially the entire internet usable as training data without a single human annotation.[1]
The objective is to predict the token at position t+1 from the tokens up to position t; across a whole sequence, the targets are simply the inputs shifted one position to the right.[1] Run this over enough text and it implicitly teaches sequential reasoning and fluency — the model has to figure out what plausibly comes next, everywhere, all the time.[2]
At each position the model outputs a probability distribution over the entire vocabulary — its guess for what comes next. Cross-entropy loss measures how surprised the model was by the actual next token: confident and right scores low, confident and wrong scores high.[3] Training is the relentless minimization of that surprise across the corpus.[3]
Each batch of text produces a loss. Backpropagation computes, for every one of the model's billions of parameters, the direction that would reduce that loss; an optimizer (Adam or AdamW) then nudges them all a tiny step that way.[3] Repeat across billions of examples and the loss steadily falls, sharpening the model's predictions a fraction at a time.[4]
This is the surprising part. To predict the next token well across the whole internet, the model is forced to internalize grammar, word meanings, and a great deal of world knowledge.[4] And capabilities nobody trained for emerge as side effects — including in-context learning, the ability to pick up a new task from examples in the prompt, which arises directly from ordinary next-token pretraining rather than being bolted on.[9]
The fuel is web-scale raw text such as Common Crawl — but raw web data is riddled with errors, gibberish, and duplicates that would mislead the model and waste compute, so it is aggressively cleaned, filtered, and deduplicated.[5] The attrition is dramatic: a pipeline might begin near 100 PB of raw crawl and end around 1 PB of usable corpus — on the order of 1% surviving.[6] Deduplication in particular is mission-critical at trillion-token scale, done by exact hashing, embedding similarity, or approximate methods like MinHash.[7]
Pretraining is the giant, general, astronomically expensive phase — the most compute-intensive part of building an LLM, often costing tens to hundreds of millions of dollars.[8] What it produces is a base model that simply continues text. Turning that into a helpful assistant is a separate, far smaller adaptation phase on targeted data[5] — the territory of the adaptation discipline one tier up.
End to end, pretraining is one loop run at enormous scale: feed text in, predict the next token, score the surprise, push the weights toward less surprise, repeat — until a base model emerges, which adaptation later turns into an assistant.
The throughline: a single, almost trivial objective — guess the next token — applied to enough text with enough compute, is what turns random weights into a model that knows grammar, facts, and more. This is the phase that creates the weights every other discipline depends on. How much data, compute, and how many parameters it needs is governed by scaling laws (Foundation 04). The base-model-to-assistant handoff at the end is precisely the adaptation discipline. And next-token prediction is not only how the model is trained — it is literally what the model does at generation time (Foundation 05), just with the weights frozen and no loss computed.
The model only ever learns to predict text. It is not optimizing for truth or helpfulness, only for statistical likelihood — so a fluent falsehood is, to the loss function, an excellent prediction. That is the structural root of hallucination, and the reason alignment is a separate problem rather than something pretraining solves on its own.
Garbage in, garbage out — at planetary scale. The model absorbs whatever the data contains, including its biases, errors, and imbalances, which is why the unglamorous cleaning-and-dedup pipeline matters as much as the architecture. The quality of the data, not just its quantity, sets the ceiling.