Research Note  ·  Applied ML / Retrieval

Retrieval & Context Engineering for AI Systems RAG, Grounding, and the Context Window

Roadmap · Discipline 02 of the LLM Systems series
TL;DR — Scope

This note covers the retrieval and context discipline: how to put the right external information in front of an LLM at inference time, and how to manage its limited working memory so it answers from facts rather than guesses. It spans Retrieval-Augmented Generation (RAG) — ingestion, embeddings, vector search, chunking, hybrid retrieval, reranking — and the broader practice of context engineering: curating the entire context window. It deliberately excludes how the model's weights are trained (that is adaptation) and how its outputs are scored (that is evaluation), though it connects tightly to both.

01

The Problem

A raw LLM has two hard limits. First, its knowledge is frozen at training time — it cannot see your private documents, and it cannot know anything that happened after its cutoff. RAG addresses this by combining retrieval with generation: an embedding model converts your data into vectors, semantic search finds the passages most similar to a question, and those passages are handed to the model as grounding.[1]

Second, the model has a finite context window — a limited working memory. The naive instinct is to fill it: dump everything in. But quality degrades as the window fills, so the real task is to supply the minimal set of high-signal information, well-ordered, rather than everything available. This is why practitioners now say most agent failures are context failures, not model failures[8] — and why curating the window has become arguably the central job in building reliable systems.

Phase 01
Retrieve
Find passages relevant to the query
Phase 02
Augment
Inject them into the prompt as context
Phase 03
Generate
Answer grounded in that context
FIG. 1 — The RAG loop in three moves. Instead of relying on what the model memorized, you fetch facts at query time and ground the answer in them.
02

The Concepts

Embeddings & Vector Search

An embedding model turns text into a numerical vector that captures its meaning, positioning it in a high-dimensional space where similar ideas sit close together.[1] At query time the question is embedded the same way, and the system retrieves the nearest passages by similarity (commonly cosine distance). This is "semantic" search: it matches on meaning, not exact keywords.

semantic space (2-D projection) retrieved neighborhood (top-k) query relevant chunks unrelated chunks
FIG. 2 — Why semantic search works. Meaning becomes geometry: the query lands near passages about the same thing, and "retrieval" is simply grabbing its nearest neighbors. Illustrative 2-D sketch of a high-dimensional space.

Chunking — the unglamorous bottleneck

Documents are too large to retrieve whole, so they are split into chunks that get embedded and indexed independently. This is the single most failure-prone decision in the stack: by one estimate, roughly 80% of RAG failures trace back to the ingestion and chunking layer rather than the model.[3] Naive fixed-size splitting is destructive — it breaks paragraphs mid-thought and separates a question from its answer.[7] Chunk size is a genuine tradeoff, typically landing between roughly 100 and 600 tokens: larger chunks carry more context but blunt retrieval precision, smaller ones retrieve precisely but lose surrounding meaning.[4] A strong fix is contextual retrieval, which prepends each chunk with document-level context so it keeps its relationship to the whole.[4]

Hybrid Search & Reranking

Pure vector search has a weakness: high similarity is not the same as relevance — "how do I reset my password" and "password reset policy" look close but serve different intents.[7] Production systems therefore use two stages. First, hybrid retrieval combines dense vector search with sparse keyword search (BM25) to maximize recall — casting a wide net. Then a reranker (a cross-encoder) re-scores those candidates for true relevance to the question and keeps only the best.[2] Broad recall first, sharp precision second.

Context Engineering — the superset

RAG is one source of context. Context engineering is the broader discipline of curating the entire window — system instructions, history, retrieved passages, tool outputs, and output format — to give the model the right information, in the right format, at the right time.[5] For agents this is dynamic: new context floods in from every tool call, so it must be actively managed rather than written once.[6] The four working strategies:

Write
save context externally (scratchpads, memory)
Select
pull in only what's relevant now
Compress
summarize to shrink token load
Isolate
split work across sub-agents

The mantra: a context window is not a static string but a dynamic system that runs before every model call.[5]

Context Rot & "Lost in the Middle"

Bigger context windows did not end the problem. Retrieval quality degrades as context length grows even for large-window models, and shorter, more precise context often beats dumping 50K tokens of retrieved text.[3] Models also miss information placed mid-sequence — the "lost in the middle" effect — so practitioners keep only the top 3–5 passages and place the strongest evidence at the very start and end.[4] One study found accuracy dropping around 32K tokens, far below advertised million-token limits.[5]

acc. context length → ≈32K tokens: accuracy starts falling off short & precise "dump everything"
FIG. 3 — Context rot. More tokens is not more performance; past a point, extra context dilutes attention and accuracy declines. Schematic curve based on the cited long-context findings.

The Four Failure Modes

When context is mismanaged, it fails in four distinct ways — each needing a different fix.[5]

1Poisoning

A wrong or hallucinated fact enters the context and gets treated as ground truth thereafter.

2Distraction

So much context accumulates that the model loses focus on the actual task.

3Confusion

Irrelevant-but-plausible content pulls the answer off course.

4Clash

Retrieved passages contradict each other, and the model can't reconcile them.

03

How It All Fits Together

The pieces run as a two-clock pipeline. Index time happens offline, once: documents are chunked, embedded, and stored in a vector index. Query time happens live, per request: the query is rewritten, retrieved against that index with hybrid search, reranked, assembled into a tight context, and only then generated from.

INDEX TIME · offline, once Documents Chunk Embed Vector Index the shared store QUERY TIME · live, per request Query Rewrite+ embed Retrievehybrid Rerank Assemblecontext Generate searches the index ↗
FIG. 4 — The production pipeline. Both clocks meet at the vector index: index time fills it, query time searches it. The "assemble" step is where context engineering lives — deciding what actually reaches the model. Solid = data flow · dashed = lookup.

The throughline: retrieval decides what the model could know, context engineering decides what it actually sees, and the generator can only be as good as that assembled window. RAG supplies the facts; context engineering rations the attention. And the whole pipeline only improves if it is measured — which is exactly where this discipline hands off to evaluation: the "RAG triad" of faithfulness, answer relevance, and context relevance is how you score each hop and gate changes to chunking, embeddings, or retrieval before they ship.

Bigger context windows are not the fix. The "just dump everything in" instinct fails because of context rot — accuracy can fall well before a model's advertised token limit, so curation beats capacity.

The model is usually not your bottleneck — retrieval is. Teams routinely spend weeks swapping models and tuning prompts while their retrieval quietly returns the wrong context every few queries. Measure the retrieval layer first.

References

  1. IBMWhat is RAG (Retrieval-Augmented Generation)? — https://www.ibm.com/think/topics/retrieval-augmented-generation
  2. PineconeRetrieval-Augmented Generation — https://www.pinecone.io/learn/retrieval-augmented-generation/
  3. PremAIBuilding Production RAG: Architecture, Chunking, Evaluation & Monitoring (2026 Guide) — https://blog.premai.io/building-production-rag-architecture-chunking-evaluation-monitoring-2026-guide/
  4. Maxim AISolving the 'Lost in the Middle' Problem: Advanced RAG Techniques — https://www.getmaxim.ai/articles/solving-the-lost-in-the-middle-problem-advanced-rag-techniques-for-long-context-llms/
  5. FirecrawlContext Engineering vs Prompt Engineering for AI Agents — https://www.firecrawl.dev/blog/context-engineering
  6. FlowHuntContext Engineering for AI Agents — https://www.flowhunt.io/blog/context-engineering-for-ai-agents/
  7. DEV CommunityRAG Is Not Dead: Advanced Retrieval Patterns That Actually Work in 2026 — https://dev.to/young_gao/rag-is-not-dead-advanced-retrieval-patterns-that-actually-work-in-2026-2gbo
  8. Medium · J. Tan RuanContext Engineering in LLM-Based Agents — https://jtanruan.medium.com/context-engineering-in-llm-based-agents-d670d6b439bc