Research Note  ·  Applied ML / Serving

Inference & Serving for AI Systems Latency, Throughput, and the Cost of Every Token

Roadmap · Discipline 05 of the LLM Systems series
TL;DR — Scope

This note covers the inference and serving discipline: the engineering of turning a trained model's frozen weights into tokens at acceptable speed and cost in production. It spans the two inference phases (prefill and decode), the KV cache, the efficiency levers (quantization, batching, speculative decoding), the user-facing latency metrics versus system throughput, and the fundamental tradeoff between them. It excludes how the model is built or trained, and what it is used for — this is the layer underneath all of that: every prompt, retrieval, agent step, and eval run ultimately resolves to an inference call.

01

The Problem

Training a model happens once; serving it happens forever. That is why inference, not training, is the overwhelming majority of an LLM system's operational cost — and a single efficiency gain compounds across every request, every session, and every agentic loop.[3] The naive way to serve is wasteful: process one request at a time, and the expensive GPU sits mostly idle while everything else waits.[2]

The striking part is the size of the gap. Moving from naive to optimized serving is worth roughly 10–20× in throughput and 5–10× in cost — wider than the gap between whole GPU generations.[1][3] So the discipline is not about buying bigger hardware; it is about squeezing the most useful tokens per dollar and per second out of fixed hardware, without degrading the output.

PREFILL · compute-bound t1 t2 t3 t4 whole prompt, one parallel pass first token → sets TTFT DECODE · memory-bandwidth-bound t5 t6 t7 t8 one token at a time, each needs the last → sets TPOT (per-token rate)
FIG. 1 — Two phases, opposite bottlenecks. Prefill chews the whole prompt at once (compute-bound) and fixes how long until the first token appears. Decode then trickles out tokens one by one (memory-bandwidth-bound), setting the streaming speed. Almost every optimization targets one phase or the other.
02

The Concepts

The KV Cache

To avoid recomputing attention over the entire sequence at every step, the model caches the key and value vectors of all the tokens it has already processed.[6] It is a huge speedup — but it has a cost that bites: the cache grows linearly with batch size times sequence length, so for long contexts and many concurrent requests it becomes the memory bottleneck of the whole serving system.[6] Techniques like PagedAttention manage it like virtual memory, handing out non-contiguous blocks instead of pre-allocating one big slab.[2]

KV mem sequence length → batch = 1 batch = N (steeper) linear growth · the system's memory ceiling
FIG. 2 — Why the KV cache is the bottleneck. Its footprint climbs linearly with how long the conversation is and how many requests you batch together — which is exactly why long context and high concurrency are expensive.

Quantization

Store the weights — and sometimes activations and the KV cache — at lower numerical precision, dropping from 16-bit down to 8-bit, 4-bit, or FP8.[5] Because generation is limited by how fast data moves out of GPU memory, moving fewer bits directly speeds up token generation and shrinks the footprint enough to run a 70B model on a consumer GPU.[1][4] The tradeoff is blunt: lower precision can degrade output quality, so the right bit-width is an empirical question.[4]

Batching

Serving many requests at once is how you actually utilize an expensive GPU.[4] Static batching waits for every request in a batch to finish before admitting new ones, so short requests stall behind long ones. Continuous (in-flight) batching slots a new request in the moment any sequence completes, keeping utilization near full.[3] The payoff is large: batching dozens of requests together can cut per-token cost by roughly 85%, which is why continuous batching is considered indispensable for online services.[3][4]

STATIC · waits for the batch to finish req A req B req C ↤ idle GPU until the slowest finishes CONTINUOUS · fills the slot instantly req A B → D C → E → F ↤ new requests slot in, GPU stays full
FIG. 3 — Static vs continuous batching. Under static batching the GPU idles (dashed) until the longest request in the batch finishes; continuous batching admits a fresh request the instant a slot opens, keeping the hardware busy and slashing per-token cost.

Speculative Decoding

A small, fast draft model proposes several tokens ahead; the large target model then verifies them in a single parallel pass and keeps the matching prefix.[9] It is lossless — only tokens the target itself would have produced are accepted — and it attacks the sequential bottleneck of decode. But it does not touch prefill, so it leaves time-to-first-token essentially unchanged, and its payoff depends heavily on how predictable the output is: highly structured text accelerates a lot, open-ended generation barely at all.[10]

Draft modelsmall · fast proposes ahead d1 d2 d3 d4 Target modelverifies all in ONE pass keep prefix
FIG. 4 — Speculative decoding. The cheap draft model races ahead with guesses; the expensive target model checks them all at once and commits the run that matches. Several tokens for the price of roughly one target step — when the guesses are good.

Metrics & the Fundamental Tradeoff

Three numbers govern serving. They split into what the user feels and what the operator pays for.[8]

TTFT
user-felt
time to first token — perceived responsiveness
TPOT
user-felt
time per output token — how smoothly text streams
TPS
operator
tokens per second — total capacity & cost-efficiency

The hard part is that latency and throughput pull against each other. Batching raises throughput but interleaves heavy prefills with light decodes, regressing both TTFT and TPOT; and past a certain batch size the system goes compute-bound, where doubling the batch only adds latency without buying more throughput.[7][4]

batch size / load → throughput (TPS) latency knee: compute-bound more batch → only more latency, no more throughput
FIG. 5 — The latency–throughput tradeoff. Adding load lifts throughput until it plateaus, but latency keeps climbing — and worsens past the knee. There is no single "fast" setting; you tune toward whichever metric your application lives or dies by. Schematic.
03

How It All Fits Together

A single request walks a fixed path: the prompt is read in a prefill pass, its keys and values land in the KV cache, then the decode loop emits tokens one at a time, reusing and extending that cache until it stops.

In
Prompt
tokens arrive
Phase 1
Prefill
parallel pass · TTFT
Store
KV cache
keys/values held
Phase 2
Decode loop
one token at a time · TPOT
Out
Tokens
streamed back
FIG. 6 — The request lifecycle. Quantization shrinks the weights this path moves, the KV cache is what decode reuses, batching overlaps many of these paths on one GPU, and speculative decoding shortcuts the decode loop. Every lever in this note plugs into one of these stages.

The throughline: prefill sets how quickly you answer, decode sets how quickly you stream, the KV cache sets your memory ceiling, and batching sets how many users share the GPU — and every optimization is a move on one of those, paid for somewhere else. This discipline sits beneath the whole roadmap: each RAG retrieval-then-generate, each step of an agent's loop, and each run of an evaluation suite is one or more inference calls, so the cost and latency here set the economics of everything built on top. It also closes a loop with adaptation — a distilled or quantized smaller model is precisely the lever that makes serving cheap, which is why "make it smaller" and "serve it efficiently" are two halves of the same cost story.

Every optimization is a tradeoff, and only your workload settles it. Quantization can erode quality; batching lifts throughput but can hurt any one user's latency; speculative decoding speeds decode but adds nothing to TTFT and only pays off on predictable output. Benchmark on your own traffic, not a vendor's demo.

Optimize the metric that matches the application. Chasing throughput for an interactive chatbot, or tight latency for an overnight batch job, is optimizing the wrong thing. Decide whether you are latency-bound or throughput-bound first, then tune toward it.

References

  1. PR-PeriLLM Inference Optimization: Quantization, KV Cache, and Serving at Scale — https://pr-peri.github.io/blogpost/2026/03/25/blogpost-llm-quantization-kv-cache.html
  2. RunpodLLM inference optimization: techniques that actually reduce latency and cost — https://www.runpod.io/blog/llm-inference-optimization-techniques-reduce-latency-cost
  3. MorphLLM Inference: Prefill, Decode, KV Cache & Cost Guide (2026) — https://www.morphllm.com/llm-inference
  4. DatabricksLLM Inference Performance Engineering: Best Practices — https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices
  5. ClarifaiLLM Inference Optimization Techniques — https://www.clarifai.com/blog/llm-inference-optimization/
  6. arXiv · KVTunerKVTuner: Sensitivity-Aware Mixed-Precision KV Cache Quantization — https://arxiv.org/pdf/2502.04420
  7. Medium · J. RayThroughput-Latency tradeoff in LLM Inference — Part II — https://medium.com/better-ml/throughput-latency-tradeoff-in-llm-inference-part-ii-6fa67d975aaa
  8. Medium · LearnWithNKDecoding Real-Time LLM Inference: Latency vs. Throughput — https://medium.com/learnwithnk/decoding-real-time-llm-inference-a-guide-to-the-latency-vs-throughput-bottleneck-c1ad96442d50
  9. arXivDecoding Speculative Decoding — https://arxiv.org/pdf/2402.01528
  10. AWSAccelerating decode-heavy LLM inference with speculative decoding — https://aws.amazon.com/blogs/machine-learning/accelerating-decode-heavy-llm-inference-with-speculative-decoding-on-aws-trainium-and-vllm/