This note covers the inference and serving discipline: the engineering of turning a trained model's frozen weights into tokens at acceptable speed and cost in production. It spans the two inference phases (prefill and decode), the KV cache, the efficiency levers (quantization, batching, speculative decoding), the user-facing latency metrics versus system throughput, and the fundamental tradeoff between them. It excludes how the model is built or trained, and what it is used for — this is the layer underneath all of that: every prompt, retrieval, agent step, and eval run ultimately resolves to an inference call.
Training a model happens once; serving it happens forever. That is why inference, not training, is the overwhelming majority of an LLM system's operational cost — and a single efficiency gain compounds across every request, every session, and every agentic loop.[3] The naive way to serve is wasteful: process one request at a time, and the expensive GPU sits mostly idle while everything else waits.[2]
The striking part is the size of the gap. Moving from naive to optimized serving is worth roughly 10–20× in throughput and 5–10× in cost — wider than the gap between whole GPU generations.[1][3] So the discipline is not about buying bigger hardware; it is about squeezing the most useful tokens per dollar and per second out of fixed hardware, without degrading the output.
To avoid recomputing attention over the entire sequence at every step, the model caches the key and value vectors of all the tokens it has already processed.[6] It is a huge speedup — but it has a cost that bites: the cache grows linearly with batch size times sequence length, so for long contexts and many concurrent requests it becomes the memory bottleneck of the whole serving system.[6] Techniques like PagedAttention manage it like virtual memory, handing out non-contiguous blocks instead of pre-allocating one big slab.[2]
Store the weights — and sometimes activations and the KV cache — at lower numerical precision, dropping from 16-bit down to 8-bit, 4-bit, or FP8.[5] Because generation is limited by how fast data moves out of GPU memory, moving fewer bits directly speeds up token generation and shrinks the footprint enough to run a 70B model on a consumer GPU.[1][4] The tradeoff is blunt: lower precision can degrade output quality, so the right bit-width is an empirical question.[4]
Serving many requests at once is how you actually utilize an expensive GPU.[4] Static batching waits for every request in a batch to finish before admitting new ones, so short requests stall behind long ones. Continuous (in-flight) batching slots a new request in the moment any sequence completes, keeping utilization near full.[3] The payoff is large: batching dozens of requests together can cut per-token cost by roughly 85%, which is why continuous batching is considered indispensable for online services.[3][4]
A small, fast draft model proposes several tokens ahead; the large target model then verifies them in a single parallel pass and keeps the matching prefix.[9] It is lossless — only tokens the target itself would have produced are accepted — and it attacks the sequential bottleneck of decode. But it does not touch prefill, so it leaves time-to-first-token essentially unchanged, and its payoff depends heavily on how predictable the output is: highly structured text accelerates a lot, open-ended generation barely at all.[10]
Three numbers govern serving. They split into what the user feels and what the operator pays for.[8]
The hard part is that latency and throughput pull against each other. Batching raises throughput but interleaves heavy prefills with light decodes, regressing both TTFT and TPOT; and past a certain batch size the system goes compute-bound, where doubling the batch only adds latency without buying more throughput.[7][4]
A single request walks a fixed path: the prompt is read in a prefill pass, its keys and values land in the KV cache, then the decode loop emits tokens one at a time, reusing and extending that cache until it stops.
The throughline: prefill sets how quickly you answer, decode sets how quickly you stream, the KV cache sets your memory ceiling, and batching sets how many users share the GPU — and every optimization is a move on one of those, paid for somewhere else. This discipline sits beneath the whole roadmap: each RAG retrieval-then-generate, each step of an agent's loop, and each run of an evaluation suite is one or more inference calls, so the cost and latency here set the economics of everything built on top. It also closes a loop with adaptation — a distilled or quantized smaller model is precisely the lever that makes serving cheap, which is why "make it smaller" and "serve it efficiently" are two halves of the same cost story.
Every optimization is a tradeoff, and only your workload settles it. Quantization can erode quality; batching lifts throughput but can hurt any one user's latency; speculative decoding speeds decode but adds nothing to TTFT and only pays off on predictable output. Benchmark on your own traffic, not a vendor's demo.
Optimize the metric that matches the application. Chasing throughput for an interactive chatbot, or tight latency for an overnight batch job, is optimizing the wrong thing. Decide whether you are latency-bound or throughput-bound first, then tune toward it.