Inference & Serving for AI Systems: Latency, Throughput, and the Cost of Every Token

TL;DR — Scope

This note covers the inference and serving discipline: the engineering of turning a trained model's frozen weights into tokens at acceptable speed and cost in production. It spans the two inference phases (prefill and decode), the KV cache, the efficiency levers (quantization, batching, speculative decoding), the user-facing latency metrics versus system throughput, and the fundamental tradeoff between them. It excludes how the model is built or trained, and what it is used for — this is the layer underneath all of that: every prompt, retrieval, agent step, and eval run ultimately resolves to an inference call.

The Problem

Training a model happens once; serving it happens forever. That is why inference, not training, is the overwhelming majority of an LLM system's operational cost — and a single efficiency gain compounds across every request, every session, and every agentic loop.^[3] The naive way to serve is wasteful: process one request at a time, and the expensive GPU sits mostly idle while everything else waits.^[2]

The striking part is the size of the gap. Moving from naive to optimized serving is worth roughly 10–20× in throughput and 5–10× in cost — wider than the gap between whole GPU generations.^[1]^[3] So the discipline is not about buying bigger hardware; it is about squeezing the most useful tokens per dollar and per second out of fixed hardware, without degrading the output.

FIG. 1 — Two phases, opposite bottlenecks. Prefill chews the whole prompt at once (compute-bound) and fixes how long until the first token appears. Decode then trickles out tokens one by one (memory-bandwidth-bound), setting the streaming speed. Almost every optimization targets one phase or the other.

The Concepts

The KV Cache

To avoid recomputing attention over the entire sequence at every step, the model caches the key and value vectors of all the tokens it has already processed.^[6] It is a huge speedup — but it has a cost that bites: the cache grows linearly with batch size times sequence length, so for long contexts and many concurrent requests it becomes the memory bottleneck of the whole serving system.^[6] Techniques like PagedAttention manage it like virtual memory, handing out non-contiguous blocks instead of pre-allocating one big slab.^[2]

FIG. 2 — Why the KV cache is the bottleneck. Its footprint climbs linearly with how long the conversation is and how many requests you batch together — which is exactly why long context and high concurrency are expensive.

Quantization

Store the weights — and sometimes activations and the KV cache — at lower numerical precision, dropping from 16-bit down to 8-bit, 4-bit, or FP8.^[5] Because generation is limited by how fast data moves out of GPU memory, moving fewer bits directly speeds up token generation and shrinks the footprint enough to run a 70B model on a consumer GPU.^[1]^[4] The tradeoff is blunt: lower precision can degrade output quality, so the right bit-width is an empirical question.^[4]

Batching

Serving many requests at once is how you actually utilize an expensive GPU.^[4] Static batching waits for every request in a batch to finish before admitting new ones, so short requests stall behind long ones. Continuous (in-flight) batching slots a new request in the moment any sequence completes, keeping utilization near full.^[3] The payoff is large: batching dozens of requests together can cut per-token cost by roughly 85%, which is why continuous batching is considered indispensable for online services.^[3]^[4]

FIG. 3 — Static vs continuous batching. Under static batching the GPU idles (dashed) until the longest request in the batch finishes; continuous batching admits a fresh request the instant a slot opens, keeping the hardware busy and slashing per-token cost.

Speculative Decoding

A small, fast draft model proposes several tokens ahead; the large target model then verifies them in a single parallel pass and keeps the matching prefix.^[9] It is lossless — only tokens the target itself would have produced are accepted — and it attacks the sequential bottleneck of decode. But it does not touch prefill, so it leaves time-to-first-token essentially unchanged, and its payoff depends heavily on how predictable the output is: highly structured text accelerates a lot, open-ended generation barely at all.^[10]

FIG. 4 — Speculative decoding. The cheap draft model races ahead with guesses; the expensive target model checks them all at once and commits the run that matches. Several tokens for the price of roughly one target step — when the guesses are good.

Metrics & the Fundamental Tradeoff

Three numbers govern serving. They split into what the user feels and what the operator pays for.^[8]

TTFT

user-felt

time to first token — perceived responsiveness

TPOT

user-felt

time per output token — how smoothly text streams

TPS

operator

tokens per second — total capacity & cost-efficiency

The hard part is that latency and throughput pull against each other. Batching raises throughput but interleaves heavy prefills with light decodes, regressing both TTFT and TPOT; and past a certain batch size the system goes compute-bound, where doubling the batch only adds latency without buying more throughput.^[7]^[4]

FIG. 5 — The latency–throughput tradeoff. Adding load lifts throughput until it plateaus, but latency keeps climbing — and worsens past the knee. There is no single "fast" setting; you tune toward whichever metric your application lives or dies by. Schematic.

How It All Fits Together

A single request walks a fixed path: the prompt is read in a prefill pass, its keys and values land in the KV cache, then the decode loop emits tokens one at a time, reusing and extending that cache until it stops.

Prompt

tokens arrive

→

Phase 1

Prefill

parallel pass · TTFT

→

Store

KV cache

keys/values held

→

Phase 2

Decode loop

one token at a time · TPOT

→

Out

Tokens

streamed back

FIG. 6 — The request lifecycle. Quantization shrinks the weights this path moves, the KV cache is what decode reuses, batching overlaps many of these paths on one GPU, and speculative decoding shortcuts the decode loop. Every lever in this note plugs into one of these stages.

The throughline: prefill sets how quickly you answer, decode sets how quickly you stream, the KV cache sets your memory ceiling, and batching sets how many users share the GPU — and every optimization is a move on one of those, paid for somewhere else. This discipline sits beneath the whole roadmap: each RAG retrieval-then-generate, each step of an agent's loop, and each run of an evaluation suite is one or more inference calls, so the cost and latency here set the economics of everything built on top. It also closes a loop with adaptation — a distilled or quantized smaller model is precisely the lever that makes serving cheap, which is why "make it smaller" and "serve it efficiently" are two halves of the same cost story.

①

Every optimization is a tradeoff, and only your workload settles it. Quantization can erode quality; batching lifts throughput but can hurt any one user's latency; speculative decoding speeds decode but adds nothing to TTFT and only pays off on predictable output. Benchmark on your own traffic, not a vendor's demo.

②

Optimize the metric that matches the application. Chasing throughput for an interactive chatbot, or tight latency for an overnight batch job, is optimizing the wrong thing. Decide whether you are latency-bound or throughput-bound first, then tune toward it.

Inference & Serving for AI Systems Latency, Throughput, and the Cost of Every Token