Research Note  ·  Foundations / Generation

Sampling & Decoding How a Trained Model Turns Probabilities Into Words

Foundations · 05 of the LLM Foundations series · Tier 1
TL;DR — Scope

This note covers what happens at generation time: how a finished model turns its next-token probabilities into actual text. The architecture (Foundation 02) and training (Foundation 03) give you a model that outputs a probability distribution over the vocabulary at each step; decoding is the rule for choosing from it, one token at a time, in an autoregressive loop. It spans greedy decoding, beam search, sampling, and the three knobs that shape sampling — temperature, top-k, and top-p — plus the coherence-versus-creativity tradeoff. It excludes the engineering of doing this fast and cheaply (the Inference & Serving discipline) and the limits set by context length (Foundation 06).

01

The Problem

After pretraining, the model does exactly one thing: given some text, it outputs a probability for every possible next token.[1] But a probability distribution is not text. Something has to convert “mat: 0.62, rug: 0.18, floor: 0.09, …” into one chosen word — and then do it again, and again, feeding each choice back in.

That choosing rule is decoding, and it matters enormously: the same model, on the same prompt, can sound precise and robotic or fluent and inventive depending only on how you pick.[2] Always grab the single most likely token and you get safe but repetitive text; introduce randomness and you get variety but risk nonsense. Decoding is how you steer between those.

Modelforward pass Distributionprobs over vocab Pick a tokenthe decoding rule Appendfeed back in repeat until a stop token e.g. “I have a” → pick “dream” → “I have a dream” → …
FIG. 1 — The autoregressive loop. The model proposes, the decoder disposes, the choice is fed back, and the cycle repeats. Every pick conditions every later one — so a single odd choice can derail a whole paragraph.
02

The Concepts

The Autoregressive Loop

Generation is a loop. The model emits a distribution over the next token; the decoder picks one; that token is appended to the input; the whole thing is fed back to predict the next — and so on until a stop signal.[1] Every choice conditions all the choices after it, which is why one stray pick can throw off everything downstream.

Greedy Decoding

The simplest rule: always take the highest-probability token.[5] It is deterministic and fast, but myopic — the locally best token can lead into a globally worse, repetitive sequence, because it never looks ahead.[4]

Beam Search

Instead of committing to one token, keep the top-N partial sequences (“beams”) alive at each step, extend them all, and choose the best overall at the end.[7] The output is more fluent than greedy, but it is computationally expensive and tends toward generic, repetitive text — which is why it appears in machine translation far more than in chatbots.[2]

GREEDY · one path The (max) cat (max) sat (max) commit early, never reconsider BEAM (N = 2) · keep best sequences The cat dog sat ran explore two, keep the best total at the end
FIG. 2 — Greedy vs beam. Greedy locks onto the single most-probable token at every step; beam search keeps several candidate sequences in play and commits only at the end — more fluent, but heavier and often blander. Illustrative.

Sampling & Temperature

Rather than always taking the top token, sampling randomly draws from the distribution: high-probability tokens are likely, but the tail still gets a chance, which produces diverse, creative text.[2] Temperature reshapes the distribution before drawing — a low temperature sharpens it toward the favorites (more deterministic), a high temperature flattens it toward the long tail (more random), and temperature 0 collapses back to greedy.[3][2] Almost every modern LLM API generates by sampling rather than greedy or beam search.[2]

low temp ≈ 0.3 peaked · near-deterministic medium ≈ 1.0 the model's own odds high temp ≈ 1.5 flattened · more random
FIG. 3 — Temperature. The same five candidate tokens, three temperatures. Low temperature concentrates probability on the favorites; high temperature spreads it toward the tail, trading reliability for diversity. Schematic.

Top-k & Top-p (Nucleus)

Pure sampling can occasionally pick an absurd tail token, so the tail is usually truncated first.[8] Top-k keeps the k most likely tokens — a fixed-size set — then renormalizes and samples.[8] Top-p, or nucleus sampling, instead keeps the smallest set of tokens whose cumulative probability passes a threshold p, so the set is dynamic: small when the model is confident and peaked, larger when it is unsure.[6][5] It was introduced specifically to cure the repetitive, nonsensical text that other methods produced.[6]

.40 .30 .15 .10 .03 .02 top-k (k = 3): fixed count top-p (p = 0.9): dynamic — adds the 4th to reach 0.9 tail discarded, then renormalize & sample
FIG. 4 — Two ways to trim the tail. On the sorted distribution, top-k keeps a fixed number of tokens; top-p keeps however many are needed to reach a cumulative probability — fewer when the model is sure, more when it isn't. Illustrative.

The Tradeoff & Penalties

The whole space is one tradeoff: deterministic methods (greedy, beam) are coherent but bland; stochastic methods (sampling) are diverse but riskier; truncation (top-k/top-p) trims the unreliable tail to capture much of both.[8] A common add-on is a repetition or n-gram penalty, which lowers the probability of tokens that would repeat text already generated.[5]

03

How It All Fits Together

At each step the model's raw scores (logits) are reshaped by temperature, turned into probabilities, optionally truncated by top-k or top-p, and then one token is drawn — which is appended and the loop runs again.

Model
Logits
raw scores
Knob
Temperature
sharpen / flatten
Norm
Softmax
→ probabilities
Knob
Top-k / Top-p
trim the tail
Choose
Sample
draw a token
Loop
Token
append & repeat
FIG. 5 — The per-step decoding pipeline. Temperature reshapes, softmax normalizes, top-k/top-p truncates, and sampling makes the actual choice — then the token feeds back into the loop of Fig. 1.

The throughline: decoding is the bridge between the trained model and the text you actually read. It runs the very same next-token machinery as pretraining (Foundation 03), only with the weights frozen and the output sampled instead of scored against a target. The engineering of running this loop quickly — the KV cache, batching, speculative decoding — is the Inference & Serving discipline one tier up; how long the loop can run is bounded by the context window (Foundation 06); and these temperature and top-p settings are exactly the knobs an evaluation harness must pin down, because a model's measured quality depends on how it is sampled.

The settings are part of the result. Temperature and top-p are not cosmetic — they change behavior enough that the same model benchmarked at temperature 0 and at 0.8 can look like two different systems. Always report sampling settings, and fix them when comparing models.

Randomness defeats reproducibility. With sampling on, the same prompt yields a different answer every time — wonderful for creativity, painful for debugging, testing, or anything that needs a stable output. For those, temperature 0 (greedy) is the honest default.

References

  1. Aman's AI JournalToken Sampling Methods — https://aman.ai/primers/ai/token-sampling/
  2. MachineLearningPlusLLM Temperature, Top-P, and Top-K Explained — https://machinelearningplus.com/gen-ai/llm-temperature-top-p-top-k-explained/
  3. phDataHow to Tune LLM Parameters: Temperature, Top K, and Top P — https://www.phdata.io/blog/how-to-tune-llm-parameters-for-top-performance-understanding-temperature-top-k-and-top-p/
  4. Towards Data ScienceDecoding Strategies in Large Language Models — https://towardsdatascience.com/decoding-strategies-in-large-language-models-9733a8f70539/
  5. Medium · F. ChiusanoMost Used Decoding Methods for Language Models — https://medium.com/nlplanet/two-minutes-nlp-most-used-decoding-methods-for-language-models-9d44b2375612
  6. WikipediaTop-p sampling (nucleus sampling) — https://en.wikipedia.org/wiki/Top-p_sampling
  7. Medium · A. M. BHow Top-k, Top-p, Temperature, and Beam Search Shape Text Generation — https://medium.com/@ansilproabl/from-randomness-to-precision-how-top-k-top-p-temperature-and-beam-search-shape-text-generation-d1f50b5220e2
  8. arXivFoundations of Top-k Decoding For Language Models — https://arxiv.org/pdf/2505.19371