This note covers the retrieval and context discipline: how to put the right external information in front of an LLM at inference time, and how to manage its limited working memory so it answers from facts rather than guesses. It spans Retrieval-Augmented Generation (RAG) — ingestion, embeddings, vector search, chunking, hybrid retrieval, reranking — and the broader practice of context engineering: curating the entire context window. It deliberately excludes how the model's weights are trained (that is adaptation) and how its outputs are scored (that is evaluation), though it connects tightly to both.
A raw LLM has two hard limits. First, its knowledge is frozen at training time — it cannot see your private documents, and it cannot know anything that happened after its cutoff. RAG addresses this by combining retrieval with generation: an embedding model converts your data into vectors, semantic search finds the passages most similar to a question, and those passages are handed to the model as grounding.[1]
Second, the model has a finite context window — a limited working memory. The naive instinct is to fill it: dump everything in. But quality degrades as the window fills, so the real task is to supply the minimal set of high-signal information, well-ordered, rather than everything available. This is why practitioners now say most agent failures are context failures, not model failures[8] — and why curating the window has become arguably the central job in building reliable systems.
An embedding model turns text into a numerical vector that captures its meaning, positioning it in a high-dimensional space where similar ideas sit close together.[1] At query time the question is embedded the same way, and the system retrieves the nearest passages by similarity (commonly cosine distance). This is "semantic" search: it matches on meaning, not exact keywords.
Documents are too large to retrieve whole, so they are split into chunks that get embedded and indexed independently. This is the single most failure-prone decision in the stack: by one estimate, roughly 80% of RAG failures trace back to the ingestion and chunking layer rather than the model.[3] Naive fixed-size splitting is destructive — it breaks paragraphs mid-thought and separates a question from its answer.[7] Chunk size is a genuine tradeoff, typically landing between roughly 100 and 600 tokens: larger chunks carry more context but blunt retrieval precision, smaller ones retrieve precisely but lose surrounding meaning.[4] A strong fix is contextual retrieval, which prepends each chunk with document-level context so it keeps its relationship to the whole.[4]
Pure vector search has a weakness: high similarity is not the same as relevance — "how do I reset my password" and "password reset policy" look close but serve different intents.[7] Production systems therefore use two stages. First, hybrid retrieval combines dense vector search with sparse keyword search (BM25) to maximize recall — casting a wide net. Then a reranker (a cross-encoder) re-scores those candidates for true relevance to the question and keeps only the best.[2] Broad recall first, sharp precision second.
RAG is one source of context. Context engineering is the broader discipline of curating the entire window — system instructions, history, retrieved passages, tool outputs, and output format — to give the model the right information, in the right format, at the right time.[5] For agents this is dynamic: new context floods in from every tool call, so it must be actively managed rather than written once.[6] The four working strategies:
The mantra: a context window is not a static string but a dynamic system that runs before every model call.[5]
Bigger context windows did not end the problem. Retrieval quality degrades as context length grows even for large-window models, and shorter, more precise context often beats dumping 50K tokens of retrieved text.[3] Models also miss information placed mid-sequence — the "lost in the middle" effect — so practitioners keep only the top 3–5 passages and place the strongest evidence at the very start and end.[4] One study found accuracy dropping around 32K tokens, far below advertised million-token limits.[5]
When context is mismanaged, it fails in four distinct ways — each needing a different fix.[5]
A wrong or hallucinated fact enters the context and gets treated as ground truth thereafter.
So much context accumulates that the model loses focus on the actual task.
Irrelevant-but-plausible content pulls the answer off course.
Retrieved passages contradict each other, and the model can't reconcile them.
The pieces run as a two-clock pipeline. Index time happens offline, once: documents are chunked, embedded, and stored in a vector index. Query time happens live, per request: the query is rewritten, retrieved against that index with hybrid search, reranked, assembled into a tight context, and only then generated from.
The throughline: retrieval decides what the model could know, context engineering decides what it actually sees, and the generator can only be as good as that assembled window. RAG supplies the facts; context engineering rations the attention. And the whole pipeline only improves if it is measured — which is exactly where this discipline hands off to evaluation: the "RAG triad" of faithfulness, answer relevance, and context relevance is how you score each hop and gate changes to chunking, embeddings, or retrieval before they ship.
Bigger context windows are not the fix. The "just dump everything in" instinct fails because of context rot — accuracy can fall well before a model's advertised token limit, so curation beats capacity.
The model is usually not your bottleneck — retrieval is. Teams routinely spend weeks swapping models and tuning prompts while their retrieval quietly returns the wrong context every few queries. Measure the retrieval layer first.