Prompt Compression
Definition
Prompt compression encompasses techniques that reduce the token volume of LLM inputs while preserving task-relevant information. Methods include: (1) selective summarization—compressing verbose retrieved documents into concise summaries; (2) extractive compression—retaining only the most relevant sentences from retrieved chunks; (3) LLMLingua and similar models that train a small model to remove tokens unlikely to affect the output; (4) conversation history compression—summarizing older turns rather than keeping raw text; (5) prompt pruning—removing redundant instructions and examples. Compression reduces both cost (fewer input tokens) and latency, while sometimes also improving quality by reducing irrelevant context that dilutes model attention.
Why It Matters
As LLM applications scale, prompt token costs become significant. A RAG system that retrieves 10 chunks averaging 500 tokens each spends 5,000 tokens on context per query—at $15/million input tokens, that's $0.075 per query and $75,000/month at 1 million daily queries. Prompt compression reducing context by 50% halves that cost. Beyond economics, compression often improves quality by focusing the model's attention on the most relevant content and reducing the 'lost in the middle' effect where information buried in long contexts is underweighted. Compression is a high-ROI optimization for production RAG systems.
How It Works
LLMLingua-style compression uses a small, efficient language model to score each token's importance for the downstream task. Tokens below an importance threshold are removed, producing a compressed prompt that is 50-80% shorter while preserving approximately 95% of the LLM's response quality. Selective retrieval compresses by re-ranking retrieved chunks and discarding the lowest-ranked ones before constructing the prompt. Conversation compression identifies turn summaries that preserve the key established facts and decisions without retaining every word of prior turns. The best compression strategy depends on which portion of the prompt is largest.
Prompt Compression — Reduce Tokens, Preserve Meaning
Verbose Prompt
620 tokens"I would like you to please carefully read the following customer support transcript that has been provided below and then, after you have thoroughly read and understood all of the content in the transcript, I would appreciate it if you could kindly generate a concise summary that highlights the most important points..."
Compressed Prompt
38 tokens"Summarize the key points from the support transcript below."
Token reduction: 620 → 38 (94% smaller)
Compression techniques
Aggressive compression can remove semantic nuance. Always benchmark accuracy before and after compression, especially for complex reasoning tasks.
Real-World Example
A legal document analysis service retrieves an average of 15,000 tokens of relevant case law per query against their full context window of 128K tokens. After implementing LLMLingua compression with a 60% compression ratio, retrieved context averages 6,000 tokens—reducing per-query costs by 60% and cutting average query latency from 12 seconds to 4 seconds. Response quality on their evaluation set declined by only 1.8%, well within acceptable bounds. The cost savings funded development of two new features in the same quarter.
Common Mistakes
- ✕Compressing context indiscriminately without evaluating quality impact—some compression introduces errors by removing key supporting details
- ✕Compressing system prompts aggressively—system prompt instructions are typically dense and important; over-compressing them degrades reliability
- ✕Applying the same compression ratio to all document types—technical and legal text require more conservative compression than general prose
Related Terms
Context Window
A context window is the maximum amount of text (measured in tokens) that a language model can process in a single inference call, determining how much retrieved content, conversation history, and instructions can be included in a RAG prompt.
Prompt Engineering
Prompt engineering is the practice of designing and refining the text inputs given to AI language models to reliably produce accurate, useful, and well-formatted outputs for specific tasks.
Retrieval-Augmented Prompting
Retrieval-augmented prompting dynamically injects relevant documents or facts into the prompt at query time, grounding the LLM's response in current, specific knowledge rather than relying solely on its static pre-trained memory.
Token
A token is the basic unit of text an LLM processes—roughly 4 characters or 3/4 of an English word. LLM APIs charge per token, context windows are measured in tokens, and generation speed is measured in tokens per second.
Prompt Chaining
Prompt chaining connects multiple LLM calls sequentially where each step's output becomes the next step's input, enabling complex multi-stage tasks that exceed what any single prompt can accomplish reliably.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →