This note covers what happens at generation time: how a finished model turns its next-token probabilities into actual text. The architecture (Foundation 02) and training (Foundation 03) give you a model that outputs a probability distribution over the vocabulary at each step; decoding is the rule for choosing from it, one token at a time, in an autoregressive loop. It spans greedy decoding, beam search, sampling, and the three knobs that shape sampling — temperature, top-k, and top-p — plus the coherence-versus-creativity tradeoff. It excludes the engineering of doing this fast and cheaply (the Inference & Serving discipline) and the limits set by context length (Foundation 06).
After pretraining, the model does exactly one thing: given some text, it outputs a probability for every possible next token.[1] But a probability distribution is not text. Something has to convert “mat: 0.62, rug: 0.18, floor: 0.09, …” into one chosen word — and then do it again, and again, feeding each choice back in.
That choosing rule is decoding, and it matters enormously: the same model, on the same prompt, can sound precise and robotic or fluent and inventive depending only on how you pick.[2] Always grab the single most likely token and you get safe but repetitive text; introduce randomness and you get variety but risk nonsense. Decoding is how you steer between those.
Generation is a loop. The model emits a distribution over the next token; the decoder picks one; that token is appended to the input; the whole thing is fed back to predict the next — and so on until a stop signal.[1] Every choice conditions all the choices after it, which is why one stray pick can throw off everything downstream.
The simplest rule: always take the highest-probability token.[5] It is deterministic and fast, but myopic — the locally best token can lead into a globally worse, repetitive sequence, because it never looks ahead.[4]
Instead of committing to one token, keep the top-N partial sequences (“beams”) alive at each step, extend them all, and choose the best overall at the end.[7] The output is more fluent than greedy, but it is computationally expensive and tends toward generic, repetitive text — which is why it appears in machine translation far more than in chatbots.[2]
Rather than always taking the top token, sampling randomly draws from the distribution: high-probability tokens are likely, but the tail still gets a chance, which produces diverse, creative text.[2] Temperature reshapes the distribution before drawing — a low temperature sharpens it toward the favorites (more deterministic), a high temperature flattens it toward the long tail (more random), and temperature 0 collapses back to greedy.[3][2] Almost every modern LLM API generates by sampling rather than greedy or beam search.[2]
Pure sampling can occasionally pick an absurd tail token, so the tail is usually truncated first.[8] Top-k keeps the k most likely tokens — a fixed-size set — then renormalizes and samples.[8] Top-p, or nucleus sampling, instead keeps the smallest set of tokens whose cumulative probability passes a threshold p, so the set is dynamic: small when the model is confident and peaked, larger when it is unsure.[6][5] It was introduced specifically to cure the repetitive, nonsensical text that other methods produced.[6]
The whole space is one tradeoff: deterministic methods (greedy, beam) are coherent but bland; stochastic methods (sampling) are diverse but riskier; truncation (top-k/top-p) trims the unreliable tail to capture much of both.[8] A common add-on is a repetition or n-gram penalty, which lowers the probability of tokens that would repeat text already generated.[5]
At each step the model's raw scores (logits) are reshaped by temperature, turned into probabilities, optionally truncated by top-k or top-p, and then one token is drawn — which is appended and the loop runs again.
The throughline: decoding is the bridge between the trained model and the text you actually read. It runs the very same next-token machinery as pretraining (Foundation 03), only with the weights frozen and the output sampled instead of scored against a target. The engineering of running this loop quickly — the KV cache, batching, speculative decoding — is the Inference & Serving discipline one tier up; how long the loop can run is bounded by the context window (Foundation 06); and these temperature and top-p settings are exactly the knobs an evaluation harness must pin down, because a model's measured quality depends on how it is sampled.
The settings are part of the result. Temperature and top-p are not cosmetic — they change behavior enough that the same model benchmarked at temperature 0 and at 0.8 can look like two different systems. Always report sampling settings, and fix them when comparing models.
Randomness defeats reproducibility. With sampling on, the same prompt yields a different answer every time — wonderful for creativity, painful for debugging, testing, or anything that needs a stable output. For those, temperature 0 (greedy) is the honest default.