This note is the series capstone: why a model can only “see” a bounded stretch of text at once, and why enlarging that window is so costly. The context window is the token budget the model attends to; its size is set by attention, which is quadratic in length. It covers the quadratic bottleneck, the KV cache, how position is encoded (RoPE) and why long inputs strain it, the “lost in the middle” effect, and the efficiency techniques (FlashAttention, sparse / sliding-window attention, grouped-query attention) that push the window outward. It builds on the transformer (Foundation 02) and inference (the serving discipline), and excludes retrieval — the systems-level answer to a finite window.
Foundation 02 gave us attention's superpower: every token can see every other token. But that power has a brutal price. With N tokens, the model computes interactions for every pair, so the cost grows with N² — double the context and you roughly quadruple the work.[4] That is why every model ships with a context window, a hard cap on how many tokens it can attend to at once; exceed it and the model literally cannot see the rest.[1]
And even inside the window two more problems lurk: storing every token's key/value vectors consumes memory that grows with length,[10] and the model's grip on position weakens with distance, so it tends to lose track of the middle.[3] Long-range modeling is the work of pushing that boundary outward without the cost exploding.
A model can only attend to a fixed maximum number of tokens at once — its context window — which is set before training and tied directly to the attention mechanism.[1] Early models were small (LLaMA started at 2,048 tokens); modern ones reach 128K and beyond.[8][2] Anything past the cap is simply invisible.
Attention's time and memory both scale quadratically with sequence length.[6] For long inputs this quickly dominates everything else and becomes the primary bottleneck — which is exactly why the window is capped rather than infinite.[5]
To avoid recomputing the past at every step, the model caches the key and value vectors of all previous tokens — but that cache grows linearly with context length and can become the dominant memory cost, sometimes exceeding the size of the model's own weights.[10][5]
Attention itself is order-blind, so position must be encoded — and the modern standard is Rotary Position Embedding (RoPE), which represents relative position by rotating token vectors.[1] The catch: RoPE is trained on a fixed length and degrades on longer, unseen positions, where its high-frequency components blur together and distant tokens lose definition.[2]
Even within the window, models do not use all positions equally: they disproportionately attend to the beginning and end of the context and underweight the middle, so a crucial fact buried halfway through a long input can effectively vanish.[3]
Two fronts. Make attention cheaper: FlashAttention computes the exact same result with far fewer memory accesses, enabling longer contexts;[6] sparse and sliding-window schemes (Longformer, BigBird) attend mostly locally with a little global reach, achieving near-linear cost;[7] and grouped-query attention shrinks the KV cache by sharing keys and values across heads.[7] Extend position: rather than naively extrapolating RoPE, interpolate or rescale the position indices (Position Interpolation, YaRN), sometimes with a little extra training — the “train short, test long” recipe that took Llama-3 from 8K to 128K.[8][9]
Context is the budget every model works within. A long input meets the window cap; attention pays the quadratic price and the KV cache pays a linear one; efficiency and position techniques push the boundary out as far as the hardware and the math allow.
The throughline: this note closes the loop opened in Foundation 02. Attention's any-to-any reach is the very thing that makes context expensive — so this is the bill for that power. It is why the KV cache dominates serving (the Inference discipline), why long prompts cost more, and it is the foundational reason retrieval exists: when you cannot fit everything in the window, you fetch only what matters.
The substrate beneath every discipline: text becomes vectors, attention mixes them, training sets the weights, scale decides how far that goes, decoding turns probabilities back into words, and context is the window all of it runs within.
A big advertised window is not the same as using it well. A model rated for 128K or a million tokens can still lose the middle, blur distant facts, and slow down sharply — “fits in the window” is not “reliably used.”
Longer is not free, and often not better. Every extra token adds quadratic compute and linear memory, raising latency and cost, and can dilute the signal. A focused, retrieved context often beats stuffing everything in.