Token Budget
Definition
Every LLM has a context window limit (e.g., GPT-4o: 128K tokens, Claude 3.5 Sonnet: 200K tokens). In a RAG prompt, this budget must be divided among several competing components: the system prompt (instructions, persona, constraints), the retrieved context (chunks from the knowledge base), conversation history (prior messages in the session), and the user's current query. A token budget defines allocation rules for each component to ensure the total never exceeds the model's limit. Exceeding the limit causes the oldest or least-prioritized content to be truncated, potentially removing critical instructions or context. Budget management is especially important for long conversations where history accumulates over many turns.
Why It Matters
Token budget management is a production necessity, not an optimization detail. A 99helpers chatbot session that accumulates 20 conversation turns, each with a substantial response, can easily exceed 30,000 tokens of history before the current query. Without a token budget, the system might try to include all history plus 5 retrieved chunks plus a system prompt, exceeding the model's limit and either crashing with an error or silently truncating the context in unpredictable ways. Explicit token budgets ensure the most important content—current query, most relevant chunks, core instructions—always fits, with lower-priority content (old conversation history) gracefully dropped when the budget is exceeded.
How It Works
Token budget implementation: define maximum allocations (e.g., system prompt: 1000 tokens, context: 4000 tokens, history: 2000 tokens, query: 500 tokens, buffer: 500 tokens = 8000 total for a model with 8192 limit). Count tokens for each component before assembly; use a tokenizer matching the model (tiktoken for OpenAI models). If context exceeds the budget, trim to the top-N retrieved chunks by relevance score. If history exceeds the budget, remove oldest turns first (FIFO compression) or summarize old history into a condensed version. Libraries like LangChain's ConversationTokenBufferMemory manage history budgets automatically.
Token Budget — Context Window Allocation (8,192 tokens)
Budget breakdown
Overflow handling
Cost estimate
Input cost
8192 × $0.003/1M
$0.000025 / call
1,000 calls / day
× 1,000
$0.025 / day
Monthly
× 30 days
$0.75 / mo
Real-World Example
A 99helpers chatbot using GPT-4o (128K context limit) allocates: 2,000 tokens for system prompt, 8,000 for retrieved context (top 5 chunks ~1,600 tokens each), 16,000 for conversation history, 2,000 for the current query and response buffer. Most sessions stay well within the 28,000-token total. For power users with very long sessions (100+ turns), conversation history is compressed: the ConversationSummaryMemory periodically summarizes the oldest 20 turns into a 500-token summary, keeping the history budget below its limit while preserving important context established early in the session.
Common Mistakes
- ✕Setting context budgets without accounting for the system prompt and conversation history, then being surprised when long conversations cause context overflow.
- ✕Truncating by character count instead of token count—different text has different token densities, and character-based truncation can over or undercount by 30%.
- ✕Not testing token budget handling in long-session scenarios—the edge cases only appear after many conversation turns and are easy to miss in unit tests.
Related Terms
Context Window
A context window is the maximum amount of text (measured in tokens) that a language model can process in a single inference call, determining how much retrieved content, conversation history, and instructions can be included in a RAG prompt.
Generation Pipeline
A generation pipeline is the LLM-side workflow in RAG that assembles retrieved context into a prompt, calls the language model, and post-processes the output into a final user-facing answer.
RAG Pipeline
A RAG pipeline is the end-to-end sequence of components—ingestion, chunking, embedding, storage, retrieval, and generation—that transforms raw documents into AI-generated answers grounded in a knowledge base.
Retrieval Pipeline
A retrieval pipeline is the online query-time workflow that transforms a user question into a ranked set of relevant document chunks, serving as the information retrieval stage of a RAG system.
Long-Context RAG
Long-context RAG leverages LLMs with large context windows (100K+ tokens) to process many or entire documents at once, reducing reliance on retrieval precision but increasing cost and latency compared to traditional top-K retrieval.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →