Contextual Compression
Definition
Contextual compression is a post-retrieval, pre-generation technique that refines the retrieved documents before they are passed to the LLM. Rather than passing entire retrieved chunks (which may contain relevant information alongside irrelevant sections), contextual compression extracts the specific passages or sentences most relevant to the query, or summarizes the retrieved content focused on the query's information need. This produces a compressed context that is more information-dense — every token contributes to answering the question, rather than the LLM having to locate the relevant needle in a haystack of retrieved text.
Why It Matters
Contextual compression improves RAG quality in two ways: it reduces context window usage (enabling more documents to be represented in the same token budget) and it improves answer quality (the LLM works with focused, relevant text rather than full retrieved documents containing tangential information). This is particularly valuable when retrieved chunks are large or when the answer to a question is contained in one specific part of a retrieved document rather than the document as a whole. Contextual compression is especially impactful for RAG systems where retrieved documents tend to be long and only partially relevant.
How It Works
Contextual compression is implemented through two approaches: extraction (using an LLM to identify and extract the specific sentences in the retrieved document that are relevant to the query) and summarization (using an LLM to summarize the retrieved document focused on what is relevant to the query). LangChain provides a ContextualCompressionRetriever that wraps any retriever and applies a compressor (LLMChainExtractor or LLMChainFilter) to the retrieved documents. The compression step adds latency (one LLM call per retrieved document) but reduces the context passed to the final generation step, potentially reducing final generation cost and improving quality.
Contextual Compression — Before vs After
Query
"How do I cancel my subscription?"
Retrieved chunk (500 tokens)
Welcome to our platform. We offer several subscription tiers...
Our team is available 24/7 for enterprise customers via email...
To cancel your subscription, navigate to Account Settings and select Cancel Plan.
You can also upgrade or downgrade at any time from the billing section...
Cancellation takes effect at the end of your current billing period...
For refund requests, contact support@example.com within 30 days...
Our referral program offers credits for each new user you invite...
Compressor
LLM extractor
Keeps only relevant sentences
Compressed result (80 tokens)
84% reductionTo cancel your subscription, go to Account Settings and select Cancel Plan. Cancellation takes effect at the end of the billing period.
Token savings impact
Input tokens
tokens
Context slots freed
tokens
More chunks fit
per query
Real-World Example
A 99helpers customer's knowledge base contains product specification documents that are 2,000-3,000 tokens each. When a user asks about one specific feature, the retrieved spec document is passed in full as context — using 2,000+ tokens for information that is 95% irrelevant to the specific question. After implementing contextual compression that extracts the 2-3 relevant paragraphs from each spec document, average context token usage drops from 6,000 to 800 tokens per query. Final answer accuracy improves because the LLM focuses on the extracted relevant sections rather than searching through a large document.
Common Mistakes
- ✕Applying compression to every retrieved document regardless of relevance — compression is most valuable for long, partially-relevant documents; short focused chunks do not benefit
- ✕Using compression as a substitute for better retrieval — compression reduces noise in retrieved documents but does not fix retrieving the wrong documents in the first place
- ✕Ignoring the latency cost of compression — one LLM call per retrieved document adds significant latency; evaluate whether the quality improvement justifies the cost
Related Terms
Document Chunking
Document chunking is the process of splitting large documents into smaller text segments before embedding and indexing for RAG, balancing chunk size to preserve context while staying within embedding model limits and enabling precise retrieval.
Context Window
A context window is the maximum amount of text (measured in tokens) that a language model can process in a single inference call, determining how much retrieved content, conversation history, and instructions can be included in a RAG prompt.
Retrieval Precision
Retrieval precision measures the fraction of retrieved documents that are actually relevant to the query. In RAG systems, high precision means the context passed to the LLM contains mostly useful information rather than noise.
Retrieval-Augmented Generation
Retrieval-Augmented Generation (RAG) is an AI architecture that enhances large language model responses by first retrieving relevant documents from an external knowledge base and then using that retrieved content as context when generating an answer.
Reranking
Reranking is a second-stage retrieval step that takes an initial set of candidate documents returned by a fast retrieval method and reorders them using a more accurate but computationally expensive model to improve final result quality.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →