Large Language Models (LLMs)

Context Length

Definition

Context length (also called context window size) is a fundamental architectural constraint of transformer-based LLMs: the model can only attend to tokens within its context window. Everything in a single LLM request—system prompt, few-shot examples, retrieved knowledge, conversation history, user message, and the generated response—must fit within this limit. Context lengths have grown dramatically: GPT-3 had 4K tokens; GPT-4o has 128K; Claude 3.5 has 200K; Gemini 1.5 Pro has 1M. Longer context windows enable entirely different application patterns: processing entire codebases, analyzing complete legal documents, maintaining long conversation histories, or performing multi-document synthesis—tasks impossible with smaller context windows.

Why It Matters

Context length is both a capability and a cost dimension. Longer context windows enable more ambitious applications—indexing entire codebases for AI code review, including complete policy documents for compliance chatbots, or maintaining month-long conversation histories. But longer inputs cost more (priced per token), take longer to process (prefill scales with input length), and suffer from the 'lost in the middle' problem where models attend less to content in the middle of very long contexts. For 99helpers products, context length planning involves balancing: how much knowledge base context to include per query, how much conversation history to maintain, and whether the system prompt fits efficiently.

How It Works

Context length management in production: (1) calculate available context: max_context - system_prompt_tokens - reserved_response_tokens = available for history + retrieved content; (2) prioritize: system prompt (fixed) > recent conversation (FIFO truncation for old turns) > retrieved context (top-K by relevance) > user message; (3) count tokens before each API call using tiktoken or provider-specific tokenizers; (4) handle overflow gracefully with summarization (compress old history to a summary) or truncation with warning. The 'lost in the middle' phenomenon—where LLMs perform best on content at the start and end of context—suggests placing the most critical content (current query, top retrieved chunks) at context boundaries.

Context Length — Token Budget Breakdown

0Total context window: 12,800 tokens

33%

43%

18%

System Prompt(800 tok)

Conversation History(4,200 tok)

Retrieved Knowledge(5,500 tok)

Max Output(2,300 tok)

Context Windows by Model

GPT-3.5

16K tokens

GPT-4o

128K tokens

Claude 3.5

200K tokens

Gemini 1.5

1M tokens

Real-World Example

A 99helpers customer's chatbot is configured with a 3,000-token system prompt, stores up to 5,000 tokens of conversation history, and retrieves up to 8,000 tokens of knowledge base context. Total: ~16,000 tokens per request. With GPT-4o's 128K context, this is well within limits. However, a power user session reaches 50 conversation turns (25,000 tokens of history). The chatbot's history management compresses the oldest 20 turns into a 1,500-token summary: 'Earlier in this conversation, the user configured webhooks for Slack, had billing issues resolved, and requested API documentation.' Total context stays under 20,000 tokens while preserving the conversation's essential history.

Common Mistakes

✕Treating context length as the primary model selection criterion—a 200K context model is not necessarily better for your use case than a 128K model; other factors (quality, cost, latency) often matter more.
✕Filling the entire context window unnecessarily—more context often means more noise; including only the most relevant retrieved chunks is better than including everything possible.
✕Ignoring that context length affects inference cost and latency—API providers charge per token; longer contexts cost more and take longer to process in the prefill phase.

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →

Context Length

Definition

Why It Matters

How It Works

Real-World Example

Common Mistakes

Related Terms

Token

KV Cache

Token Budget

Large Language Model (LLM)

Long-Context RAG

Ready to build your AI chatbot?