Retrieval-Augmented Generation (RAG)

Contextual Compression

Definition

Contextual compression is a post-retrieval, pre-generation technique that refines the retrieved documents before they are passed to the LLM. Rather than passing entire retrieved chunks (which may contain relevant information alongside irrelevant sections), contextual compression extracts the specific passages or sentences most relevant to the query, or summarizes the retrieved content focused on the query's information need. This produces a compressed context that is more information-dense — every token contributes to answering the question, rather than the LLM having to locate the relevant needle in a haystack of retrieved text.

Why It Matters

Contextual compression improves RAG quality in two ways: it reduces context window usage (enabling more documents to be represented in the same token budget) and it improves answer quality (the LLM works with focused, relevant text rather than full retrieved documents containing tangential information). This is particularly valuable when retrieved chunks are large or when the answer to a question is contained in one specific part of a retrieved document rather than the document as a whole. Contextual compression is especially impactful for RAG systems where retrieved documents tend to be long and only partially relevant.

How It Works

Contextual compression is implemented through two approaches: extraction (using an LLM to identify and extract the specific sentences in the retrieved document that are relevant to the query) and summarization (using an LLM to summarize the retrieved document focused on what is relevant to the query). LangChain provides a ContextualCompressionRetriever that wraps any retriever and applies a compressor (LLMChainExtractor or LLMChainFilter) to the retrieved documents. The compression step adds latency (one LLM call per retrieved document) but reduces the context passed to the final generation step, potentially reducing final generation cost and improving quality.

Contextual Compression — Before vs After

Query

"How do I cancel my subscription?"

Retrieved chunk (500 tokens)

Welcome to our platform. We offer several subscription tiers...

Our team is available 24/7 for enterprise customers via email...

To cancel your subscription, navigate to Account Settings and select Cancel Plan.

You can also upgrade or downgrade at any time from the billing section...

Cancellation takes effect at the end of your current billing period...

For refund requests, contact support@example.com within 30 days...

Our referral program offers credits for each new user you invite...

Compressor

LLM extractor

Keeps only relevant sentences

Compressed result (80 tokens)

84% reduction

To cancel your subscription, go to Account Settings and select Cancel Plan. Cancellation takes effect at the end of the billing period.

Token savings impact

Input tokens

50080

tokens

Context slots freed

–420

tokens

More chunks fit

per query

Real-World Example

A 99helpers customer's knowledge base contains product specification documents that are 2,000-3,000 tokens each. When a user asks about one specific feature, the retrieved spec document is passed in full as context — using 2,000+ tokens for information that is 95% irrelevant to the specific question. After implementing contextual compression that extracts the 2-3 relevant paragraphs from each spec document, average context token usage drops from 6,000 to 800 tokens per query. Final answer accuracy improves because the LLM focuses on the extracted relevant sections rather than searching through a large document.

Common Mistakes

✕Applying compression to every retrieved document regardless of relevance — compression is most valuable for long, partially-relevant documents; short focused chunks do not benefit
✕Using compression as a substitute for better retrieval — compression reduces noise in retrieved documents but does not fix retrieving the wrong documents in the first place
✕Ignoring the latency cost of compression — one LLM call per retrieved document adds significant latency; evaluate whether the quality improvement justifies the cost

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →

Contextual Compression

Definition

Why It Matters

How It Works

Real-World Example

Common Mistakes

Related Terms

Document Chunking

Context Window

Retrieval Precision

Retrieval-Augmented Generation

Reranking

Ready to build your AI chatbot?