Context & Long-Range Modeling: Why Models Have a Memory Limit

TL;DR — Scope

This note is the series capstone: why a model can only “see” a bounded stretch of text at once, and why enlarging that window is so costly. The context window is the token budget the model attends to; its size is set by attention, which is quadratic in length. It covers the quadratic bottleneck, the KV cache, how position is encoded (RoPE) and why long inputs strain it, the “lost in the middle” effect, and the efficiency techniques (FlashAttention, sparse / sliding-window attention, grouped-query attention) that push the window outward. It builds on the transformer (Foundation 02) and inference (the serving discipline), and excludes retrieval — the systems-level answer to a finite window.

The Problem

Foundation 02 gave us attention's superpower: every token can see every other token. But that power has a brutal price. With N tokens, the model computes interactions for every pair, so the cost grows with N² — double the context and you roughly quadruple the work.^[4] That is why every model ships with a context window, a hard cap on how many tokens it can attend to at once; exceed it and the model literally cannot see the rest.^[1]

And even inside the window two more problems lurk: storing every token's key/value vectors consumes memory that grows with length,^[10] and the model's grip on position weakens with distance, so it tends to lose track of the middle.^[3] Long-range modeling is the work of pushing that boundary outward without the cost exploding.

FIG. 1 — The window. The model attends only to a bounded span of tokens at once; anything outside the window is invisible to it, no matter how relevant. Schematic.

The Concepts

The Context Window

A model can only attend to a fixed maximum number of tokens at once — its context window — which is set before training and tied directly to the attention mechanism.^[1] Early models were small (LLaMA started at 2,048 tokens); modern ones reach 128K and beyond.^[8]^[2] Anything past the cap is simply invisible.

The Quadratic Bottleneck

Attention's time and memory both scale quadratically with sequence length.^[6] For long inputs this quickly dominates everything else and becomes the primary bottleneck — which is exactly why the window is capped rather than infinite.^[5]

FIG. 2 — Why context is expensive. Most costs grow linearly with length, but attention grows with the square — so each doubling of context roughly quadruples the attention compute. Schematic.

The KV Cache

To avoid recomputing the past at every step, the model caches the key and value vectors of all previous tokens — but that cache grows linearly with context length and can become the dominant memory cost, sometimes exceeding the size of the model's own weights.^[10]^[5]

FIG. 3 — The growing cache. Every token leaves a key/value entry behind, so the cache climbs steadily with length and, in long contexts, can overtake the model itself as the memory bottleneck. Schematic.

Position & RoPE

Attention itself is order-blind, so position must be encoded — and the modern standard is Rotary Position Embedding (RoPE), which represents relative position by rotating token vectors.^[1] The catch: RoPE is trained on a fixed length and degrades on longer, unseen positions, where its high-frequency components blur together and distant tokens lose definition.^[2]

Lost in the Middle

Even within the window, models do not use all positions equally: they disproportionately attend to the beginning and end of the context and underweight the middle, so a crucial fact buried halfway through a long input can effectively vanish.^[3]

FIG. 4 — Lost in the middle. Information at the edges of a long context is recalled well; information buried in the middle is the most likely to be overlooked. Schematic.

Stretching the Window

Two fronts. Make attention cheaper: FlashAttention computes the exact same result with far fewer memory accesses, enabling longer contexts;^[6] sparse and sliding-window schemes (Longformer, BigBird) attend mostly locally with a little global reach, achieving near-linear cost;^[7] and grouped-query attention shrinks the KV cache by sharing keys and values across heads.^[7] Extend position: rather than naively extrapolating RoPE, interpolate or rescale the position indices (Position Interpolation, YaRN), sometimes with a little extra training — the “train short, test long” recipe that took Llama-3 from 8K to 128K.^[8]^[9]

FIG. 5 — Trimming the interactions. Full attention connects every token to all earlier ones (the quadratic cost); sliding-window attention keeps only a local band, trading some global reach for near-linear cost. Schematic.

How It All Fits Together

Context is the budget every model works within. A long input meets the window cap; attention pays the quadratic price and the KV cache pays a linear one; efficiency and position techniques push the boundary out as far as the hardware and the math allow.

Long input

many tokens

→

Cap

Context window

the hard limit

→

Cost

Attention N² + KV cache

compute & memory

→

Stretch

Efficiency + position

FlashAttn · sparse · RoPE

→

Out

Usable long context

further, not free

FIG. 6 — The long-context pipeline. The window sets the limit; attention and the KV cache set the cost; efficiency and position techniques extend the reach.

The throughline: this note closes the loop opened in Foundation 02. Attention's any-to-any reach is the very thing that makes context expensive — so this is the bill for that power. It is why the KV cache dominates serving (the Inference discipline), why long prompts cost more, and it is the foundational reason retrieval exists: when you cannot fit everything in the window, you fetch only what matters.

End of the Foundations series

Six foundations, one chain

The substrate beneath every discipline: text becomes vectors, attention mixes them, training sets the weights, scale decides how far that goes, decoding turns probabilities back into words, and context is the window all of it runs within.

01Tokenization & Embeddings — text → numbers that carry meaning.
02The Transformer & Attention — every token mixes in every other.
03Pretraining & Next-Token Prediction — the objective that builds the weights.
04Scaling Laws & Emergence — how far scale carries it.
05Sampling & Decoding — probabilities back into text.
06Context & Long-Range Modeling — the window it all operates within.

①

A big advertised window is not the same as using it well. A model rated for 128K or a million tokens can still lose the middle, blur distant facts, and slow down sharply — “fits in the window” is not “reliably used.”

②

Longer is not free, and often not better. Every extra token adds quadratic compute and linear memory, raising latency and cost, and can dilute the signal. A focused, retrieved context often beats stuffing everything in.

References

Puter EncyclopediaContext Window — https://developer.puter.com/encyclopedia/context-window/
Medium · R. BhandariWhen More Becomes Less: Why LLMs Hallucinate in Long Contexts — https://medium.com/design-bootcamp/when-more-becomes-less-why-llms-hallucinate-in-long-contexts-fc903be6f025
arXiv · Wang et al.Layer-Specific Scaling of Positional Encodings for Long-Context Modeling — https://arxiv.org/pdf/2503.04355
arXiv · Attention SurveyHardware-efficient, Sparse, Compact, and Linear Attention — https://attention-survey.github.io/files/Attention_Survey.pdf
arXiv · Huang et al.Write-Gated KV for Efficient Long-Context Inference — https://arxiv.org/pdf/2512.17452
arXiv · Dao et al.FlashAttention: Fast and Memory-Efficient Exact Attention — https://arxiv.org/pdf/2205.14135
arXiv · StreamSparse Attention for Long Context (FlashAttention, sliding window, GQA) — https://arxiv.org/html/2510.19875
arXiv · Chen et al.Extending Context Window via Positional Interpolation — https://arxiv.org/pdf/2306.15595
arXiv · DPEEffective Length Extrapolation via Dimension-Wise Positional Embeddings — https://arxiv.org/pdf/2504.18857
arXiv · HCAttentionExtreme KV Cache Compression for Long-Context LLMs — https://arxiv.org/pdf/2507.19823