Retrieval-Augmented Generation (RAG)

Token Budget

Definition

Every LLM has a context window limit (e.g., GPT-4o: 128K tokens, Claude 3.5 Sonnet: 200K tokens). In a RAG prompt, this budget must be divided among several competing components: the system prompt (instructions, persona, constraints), the retrieved context (chunks from the knowledge base), conversation history (prior messages in the session), and the user's current query. A token budget defines allocation rules for each component to ensure the total never exceeds the model's limit. Exceeding the limit causes the oldest or least-prioritized content to be truncated, potentially removing critical instructions or context. Budget management is especially important for long conversations where history accumulates over many turns.

Why It Matters

Token budget management is a production necessity, not an optimization detail. A 99helpers chatbot session that accumulates 20 conversation turns, each with a substantial response, can easily exceed 30,000 tokens of history before the current query. Without a token budget, the system might try to include all history plus 5 retrieved chunks plus a system prompt, exceeding the model's limit and either crashing with an error or silently truncating the context in unpredictable ways. Explicit token budgets ensure the most important content—current query, most relevant chunks, core instructions—always fits, with lower-priority content (old conversation history) gracefully dropped when the budget is exceeded.

How It Works

Token budget implementation: define maximum allocations (e.g., system prompt: 1000 tokens, context: 4000 tokens, history: 2000 tokens, query: 500 tokens, buffer: 500 tokens = 8000 total for a model with 8192 limit). Count tokens for each component before assembly; use a tokenizer matching the model (tiktoken for OpenAI models). If context exceeds the budget, trim to the top-N retrieved chunks by relevance score. If history exceeds the budget, remove oldest turns first (FIFO compression) or summarize old history into a condensed version. Libraries like LangChain's ConversationTokenBufferMemory manage history budgets automatically.

Token Budget — Context Window Allocation (8,192 tokens)

512

1024

4096

2304

System prompt (512)

Chat history (1,024)

Retrieved chunks (4,096)

Query (256)

Output reserve (2,304)

Budget breakdown

System prompt

512 tok6%

Chat history

1,024 tok13%

Retrieved chunks

4,096 tok50%

Query

256 tok3%

Output reserve

2,304 tok28%

Total

8,192 tok100%

Overflow handling

1Prune oldest chat history turns first

2Trim lowest-scored retrieved chunks

3Truncate system prompt non-essential sections

Cost estimate

Input cost

8192 × $0.003/1M

$0.000025 / call

1,000 calls / day

× 1,000

$0.025 / day

Monthly

× 30 days

$0.75 / mo

Real-World Example

A 99helpers chatbot using GPT-4o (128K context limit) allocates: 2,000 tokens for system prompt, 8,000 for retrieved context (top 5 chunks ~1,600 tokens each), 16,000 for conversation history, 2,000 for the current query and response buffer. Most sessions stay well within the 28,000-token total. For power users with very long sessions (100+ turns), conversation history is compressed: the ConversationSummaryMemory periodically summarizes the oldest 20 turns into a 500-token summary, keeping the history budget below its limit while preserving important context established early in the session.

Common Mistakes

✕Setting context budgets without accounting for the system prompt and conversation history, then being surprised when long conversations cause context overflow.
✕Truncating by character count instead of token count—different text has different token densities, and character-based truncation can over or undercount by 30%.
✕Not testing token budget handling in long-session scenarios—the edge cases only appear after many conversation turns and are easy to miss in unit tests.

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →

Token Budget

Definition

Why It Matters

How It Works

Real-World Example

Common Mistakes

Related Terms

Context Window

Generation Pipeline

RAG Pipeline

Retrieval Pipeline

Long-Context RAG

Ready to build your AI chatbot?