Context Window
Definition
The context window is the fundamental capacity limit of a language model — it defines the maximum total length of input (all text provided to the model: system instructions, retrieved documents, conversation history, and the user message) plus output (the generated response). Context windows are measured in tokens (roughly 3/4 of a word on average). Modern LLMs have context windows ranging from 8K tokens (older models) to 1M tokens (Gemini 1.5 Pro). In RAG systems, the context window determines how many retrieved chunks can be included as grounding context — more chunks provide more information but use more of the limited context window budget.
Why It Matters
Context window size is a key constraint in RAG system design. With limited context windows, engineers must make tradeoffs: include more retrieved chunks (potentially better coverage) or include longer conversation history (better multi-turn coherence). As models with larger context windows become available, it becomes tempting to retrieve many more documents rather than refining retrieval quality — but research shows that models do not always effectively use all content in very long contexts ('lost in the middle' problem). Optimal RAG systems balance context window usage by retrieving fewer, higher-quality chunks rather than maximizing chunk count.
How It Works
Context window budget management in RAG involves accounting for all token usage: system prompt (typically 200-500 tokens), retrieved context (typically 1,000-5,000 tokens), conversation history (0 to several thousand tokens depending on turn depth), and space for the generated response. Practitioners commonly set a budget of 4,000-8,000 tokens for retrieved context, then configure retrieval to return chunks that fit within this budget. For very long documents or high-context applications, context compression techniques (summarizing retrieved documents before including them) reduce token usage while preserving key information.
Context Window Budget Allocation (128k tokens)
5%
System prompt
~6.4k
60%
Retrieved chunks
~76.8k
25%
Conv. history
~32k
10%
Output reserve
~12.8k
When chunks exceed budget — truncation order
Oldest history
Trimmed first
Older chunks
Lowest relevance removed
System prompt
Never truncated
Over-budget scenario
If retrieved chunks total 90k tokens (70%+) the conversation history allocation shrinks. Implement a token counter before insertion and drop or compress lower-ranked chunks to stay within budget.
Real-World Example
A 99helpers customer runs their RAG system on a model with an 8,192-token context window. After accounting for the system prompt (400 tokens), conversation history (up to 1,000 tokens), and space for the response (500 tokens), they have approximately 6,300 tokens for retrieved context. At an average chunk size of 300 tokens, they can include up to 21 chunks. They configure retrieval to return top-7 chunks (2,100 tokens) rather than 21, avoiding the 'lost in the middle' problem where models ignore content in the middle of very long contexts. Answer accuracy is higher with 7 focused chunks than 21 diluted ones.
Common Mistakes
- ✕Filling the context window with as many retrieved chunks as possible — more context does not always mean better answers; models often ignore content in the middle of long contexts
- ✕Ignoring the context window budget when designing the system prompt — long, verbose system prompts eat into the token budget available for retrieved context
- ✕Not planning for context window growth as conversation turns accumulate — multi-turn conversations must either summarize or truncate older turns to stay within the context limit
Related Terms
Retrieval-Augmented Generation
Retrieval-Augmented Generation (RAG) is an AI architecture that enhances large language model responses by first retrieving relevant documents from an external knowledge base and then using that retrieved content as context when generating an answer.
Document Chunking
Document chunking is the process of splitting large documents into smaller text segments before embedding and indexing for RAG, balancing chunk size to preserve context while staying within embedding model limits and enabling precise retrieval.
Hallucination
Hallucination in AI refers to when a language model generates confident, plausible-sounding text that is factually incorrect, unsupported by the provided context, or entirely fabricated, posing a major reliability challenge for AI applications.
Grounding
Grounding in AI refers to anchoring a language model's responses to specific, verifiable source documents or data, reducing hallucination by ensuring the model draws on retrieved evidence rather than relying on potentially incorrect parametric knowledge.
RAG Evaluation
RAG evaluation is the systematic measurement of a RAG system's quality across multiple dimensions — including retrieval accuracy, answer faithfulness, relevance, and completeness — to identify weaknesses and guide improvement.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →