Sampling & Decoding: How a Trained Model Turns Probabilities Into Words

TL;DR — Scope

This note covers what happens at generation time: how a finished model turns its next-token probabilities into actual text. The architecture (Foundation 02) and training (Foundation 03) give you a model that outputs a probability distribution over the vocabulary at each step; decoding is the rule for choosing from it, one token at a time, in an autoregressive loop. It spans greedy decoding, beam search, sampling, and the three knobs that shape sampling — temperature, top-k, and top-p — plus the coherence-versus-creativity tradeoff. It excludes the engineering of doing this fast and cheaply (the Inference & Serving discipline) and the limits set by context length (Foundation 06).

The Problem

After pretraining, the model does exactly one thing: given some text, it outputs a probability for every possible next token.^[1] But a probability distribution is not text. Something has to convert “mat: 0.62, rug: 0.18, floor: 0.09, …” into one chosen word — and then do it again, and again, feeding each choice back in.

That choosing rule is decoding, and it matters enormously: the same model, on the same prompt, can sound precise and robotic or fluent and inventive depending only on how you pick.^[2] Always grab the single most likely token and you get safe but repetitive text; introduce randomness and you get variety but risk nonsense. Decoding is how you steer between those.

FIG. 1 — The autoregressive loop. The model proposes, the decoder disposes, the choice is fed back, and the cycle repeats. Every pick conditions every later one — so a single odd choice can derail a whole paragraph.

The Concepts

The Autoregressive Loop

Generation is a loop. The model emits a distribution over the next token; the decoder picks one; that token is appended to the input; the whole thing is fed back to predict the next — and so on until a stop signal.^[1] Every choice conditions all the choices after it, which is why one stray pick can throw off everything downstream.

Greedy Decoding

The simplest rule: always take the highest-probability token.^[5] It is deterministic and fast, but myopic — the locally best token can lead into a globally worse, repetitive sequence, because it never looks ahead.^[4]

Beam Search

Instead of committing to one token, keep the top-N partial sequences (“beams”) alive at each step, extend them all, and choose the best overall at the end.^[7] The output is more fluent than greedy, but it is computationally expensive and tends toward generic, repetitive text — which is why it appears in machine translation far more than in chatbots.^[2]

FIG. 2 — Greedy vs beam. Greedy locks onto the single most-probable token at every step; beam search keeps several candidate sequences in play and commits only at the end — more fluent, but heavier and often blander. Illustrative.

Sampling & Temperature

Rather than always taking the top token, sampling randomly draws from the distribution: high-probability tokens are likely, but the tail still gets a chance, which produces diverse, creative text.^[2] Temperature reshapes the distribution before drawing — a low temperature sharpens it toward the favorites (more deterministic), a high temperature flattens it toward the long tail (more random), and temperature 0 collapses back to greedy.^[3]^[2] Almost every modern LLM API generates by sampling rather than greedy or beam search.^[2]

FIG. 3 — Temperature. The same five candidate tokens, three temperatures. Low temperature concentrates probability on the favorites; high temperature spreads it toward the tail, trading reliability for diversity. Schematic.

Top-k & Top-p (Nucleus)

Pure sampling can occasionally pick an absurd tail token, so the tail is usually truncated first.^[8] Top-k keeps the k most likely tokens — a fixed-size set — then renormalizes and samples.^[8] Top-p, or nucleus sampling, instead keeps the smallest set of tokens whose cumulative probability passes a threshold p, so the set is dynamic: small when the model is confident and peaked, larger when it is unsure.^[6]^[5] It was introduced specifically to cure the repetitive, nonsensical text that other methods produced.^[6]

FIG. 4 — Two ways to trim the tail. On the sorted distribution, top-k keeps a fixed number of tokens; top-p keeps however many are needed to reach a cumulative probability — fewer when the model is sure, more when it isn't. Illustrative.

The Tradeoff & Penalties

The whole space is one tradeoff: deterministic methods (greedy, beam) are coherent but bland; stochastic methods (sampling) are diverse but riskier; truncation (top-k/top-p) trims the unreliable tail to capture much of both.^[8] A common add-on is a repetition or n-gram penalty, which lowers the probability of tokens that would repeat text already generated.^[5]

How It All Fits Together

At each step the model's raw scores (logits) are reshaped by temperature, turned into probabilities, optionally truncated by top-k or top-p, and then one token is drawn — which is appended and the loop runs again.

Model

Logits

raw scores

→

Knob

Temperature

sharpen / flatten

→

Norm

Softmax

→ probabilities

→

Knob

Top-k / Top-p

trim the tail

→

Choose

Sample

draw a token

→

Loop

Token

append & repeat

FIG. 5 — The per-step decoding pipeline. Temperature reshapes, softmax normalizes, top-k/top-p truncates, and sampling makes the actual choice — then the token feeds back into the loop of Fig. 1.

The throughline: decoding is the bridge between the trained model and the text you actually read. It runs the very same next-token machinery as pretraining (Foundation 03), only with the weights frozen and the output sampled instead of scored against a target. The engineering of running this loop quickly — the KV cache, batching, speculative decoding — is the Inference & Serving discipline one tier up; how long the loop can run is bounded by the context window (Foundation 06); and these temperature and top-p settings are exactly the knobs an evaluation harness must pin down, because a model's measured quality depends on how it is sampled.

①

The settings are part of the result. Temperature and top-p are not cosmetic — they change behavior enough that the same model benchmarked at temperature 0 and at 0.8 can look like two different systems. Always report sampling settings, and fix them when comparing models.

②

Randomness defeats reproducibility. With sampling on, the same prompt yields a different answer every time — wonderful for creativity, painful for debugging, testing, or anything that needs a stable output. For those, temperature 0 (greedy) is the honest default.

Sampling & Decoding How a Trained Model Turns Probabilities Into Words