Prompt Engineering

Prompt Compression

Definition

Prompt compression encompasses techniques that reduce the token volume of LLM inputs while preserving task-relevant information. Methods include: (1) selective summarization—compressing verbose retrieved documents into concise summaries; (2) extractive compression—retaining only the most relevant sentences from retrieved chunks; (3) LLMLingua and similar models that train a small model to remove tokens unlikely to affect the output; (4) conversation history compression—summarizing older turns rather than keeping raw text; (5) prompt pruning—removing redundant instructions and examples. Compression reduces both cost (fewer input tokens) and latency, while sometimes also improving quality by reducing irrelevant context that dilutes model attention.

Why It Matters

As LLM applications scale, prompt token costs become significant. A RAG system that retrieves 10 chunks averaging 500 tokens each spends 5,000 tokens on context per query—at $15/million input tokens, that's $0.075 per query and $75,000/month at 1 million daily queries. Prompt compression reducing context by 50% halves that cost. Beyond economics, compression often improves quality by focusing the model's attention on the most relevant content and reducing the 'lost in the middle' effect where information buried in long contexts is underweighted. Compression is a high-ROI optimization for production RAG systems.

How It Works

LLMLingua-style compression uses a small, efficient language model to score each token's importance for the downstream task. Tokens below an importance threshold are removed, producing a compressed prompt that is 50-80% shorter while preserving approximately 95% of the LLM's response quality. Selective retrieval compresses by re-ranking retrieved chunks and discarding the lowest-ranked ones before constructing the prompt. Conversation compression identifies turn summaries that preserve the key established facts and decisions without retaining every word of prior turns. The best compression strategy depends on which portion of the prompt is largest.

Prompt Compression — Reduce Tokens, Preserve Meaning

Verbose Prompt

620 tokens

"I would like you to please carefully read the following customer support transcript that has been provided below and then, after you have thoroughly read and understood all of the content in the transcript, I would appreciate it if you could kindly generate a concise summary that highlights the most important points..."

Compressed Prompt

38 tokens

"Summarize the key points from the support transcript below."

Token reduction: 620 → 38 (94% smaller)

38 tok

620 tok

Compression techniques

•

Remove filler phrases: "I would like you to please..." → omit

•

Truncate irrelevant context: Keep only task-relevant sections

•

LLMLingua / selective pruning: Automated token-level compression

•

Summarise retrieved chunks: Abstract long docs before injection

Aggressive compression can remove semantic nuance. Always benchmark accuracy before and after compression, especially for complex reasoning tasks.

Real-World Example

A legal document analysis service retrieves an average of 15,000 tokens of relevant case law per query against their full context window of 128K tokens. After implementing LLMLingua compression with a 60% compression ratio, retrieved context averages 6,000 tokens—reducing per-query costs by 60% and cutting average query latency from 12 seconds to 4 seconds. Response quality on their evaluation set declined by only 1.8%, well within acceptable bounds. The cost savings funded development of two new features in the same quarter.

Common Mistakes

✕Compressing context indiscriminately without evaluating quality impact—some compression introduces errors by removing key supporting details
✕Compressing system prompts aggressively—system prompt instructions are typically dense and important; over-compressing them degrades reliability
✕Applying the same compression ratio to all document types—technical and legal text require more conservative compression than general prose

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →

Prompt Compression

Definition

Why It Matters

How It Works

Real-World Example

Common Mistakes

Related Terms

Context Window

Prompt Engineering

Retrieval-Augmented Prompting

Token

Prompt Chaining

Ready to build your AI chatbot?