Context Length
Definition
Context length (also called context window size) is a fundamental architectural constraint of transformer-based LLMs: the model can only attend to tokens within its context window. Everything in a single LLM request—system prompt, few-shot examples, retrieved knowledge, conversation history, user message, and the generated response—must fit within this limit. Context lengths have grown dramatically: GPT-3 had 4K tokens; GPT-4o has 128K; Claude 3.5 has 200K; Gemini 1.5 Pro has 1M. Longer context windows enable entirely different application patterns: processing entire codebases, analyzing complete legal documents, maintaining long conversation histories, or performing multi-document synthesis—tasks impossible with smaller context windows.
Why It Matters
Context length is both a capability and a cost dimension. Longer context windows enable more ambitious applications—indexing entire codebases for AI code review, including complete policy documents for compliance chatbots, or maintaining month-long conversation histories. But longer inputs cost more (priced per token), take longer to process (prefill scales with input length), and suffer from the 'lost in the middle' problem where models attend less to content in the middle of very long contexts. For 99helpers products, context length planning involves balancing: how much knowledge base context to include per query, how much conversation history to maintain, and whether the system prompt fits efficiently.
How It Works
Context length management in production: (1) calculate available context: max_context - system_prompt_tokens - reserved_response_tokens = available for history + retrieved content; (2) prioritize: system prompt (fixed) > recent conversation (FIFO truncation for old turns) > retrieved context (top-K by relevance) > user message; (3) count tokens before each API call using tiktoken or provider-specific tokenizers; (4) handle overflow gracefully with summarization (compress old history to a summary) or truncation with warning. The 'lost in the middle' phenomenon—where LLMs perform best on content at the start and end of context—suggests placing the most critical content (current query, top retrieved chunks) at context boundaries.
Context Length — Token Budget Breakdown
Context Windows by Model
Real-World Example
A 99helpers customer's chatbot is configured with a 3,000-token system prompt, stores up to 5,000 tokens of conversation history, and retrieves up to 8,000 tokens of knowledge base context. Total: ~16,000 tokens per request. With GPT-4o's 128K context, this is well within limits. However, a power user session reaches 50 conversation turns (25,000 tokens of history). The chatbot's history management compresses the oldest 20 turns into a 1,500-token summary: 'Earlier in this conversation, the user configured webhooks for Slack, had billing issues resolved, and requested API documentation.' Total context stays under 20,000 tokens while preserving the conversation's essential history.
Common Mistakes
- ✕Treating context length as the primary model selection criterion—a 200K context model is not necessarily better for your use case than a 128K model; other factors (quality, cost, latency) often matter more.
- ✕Filling the entire context window unnecessarily—more context often means more noise; including only the most relevant retrieved chunks is better than including everything possible.
- ✕Ignoring that context length affects inference cost and latency—API providers charge per token; longer contexts cost more and take longer to process in the prefill phase.
Related Terms
Token
A token is the basic unit of text an LLM processes—roughly 4 characters or 3/4 of an English word. LLM APIs charge per token, context windows are measured in tokens, and generation speed is measured in tokens per second.
KV Cache
The KV cache stores the key and value attention tensors computed during the prefill phase, allowing subsequent token generation to reuse these computations rather than recomputing them for every new token.
Token Budget
A token budget is the maximum number of tokens allocated to different sections of an LLM prompt in a RAG system—system instructions, retrieved context, and conversation history—ensuring the total stays within the model's context window limit.
Large Language Model (LLM)
A large language model is a neural network trained on vast amounts of text that learns to predict and generate human-like text, enabling tasks like answering questions, writing, translation, and code generation.
Long-Context RAG
Long-context RAG leverages LLMs with large context windows (100K+ tokens) to process many or entire documents at once, reducing reliance on retrieval precision but increasing cost and latency compared to traditional top-K retrieval.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →