Retrieval Pipeline
Definition
The retrieval pipeline executes at query time, taking a user's question as input and producing a set of relevant document chunks as output for the generation step. Core stages are: (1) query processing—cleaning, language detection, expansion, or rewriting; (2) query embedding—converting the query to a vector using the same embedding model used during indexing; (3) vector search—querying the vector database for approximate nearest neighbors; (4) optional metadata filtering—restricting results by category, date, source, or other attributes; (5) optional reranking—using a cross-encoder to reorder results by relevance; (6) context assembly—combining retrieved chunks into a prompt context respecting the LLM's token limit. Each stage can be independently optimized.
Why It Matters
The retrieval pipeline is the quality bottleneck of most RAG systems. No matter how powerful the generation model, it cannot produce accurate answers if the retrieval pipeline fails to surface the right documents. Monitoring retrieval pipeline metrics—retrieval latency, cache hit rate, recall@K, reranker lift—is essential for maintaining chatbot quality at scale. For 99helpers customers with large, diverse knowledge bases, retrieval pipeline optimization (tuning K, adding reranking, implementing metadata filters) often produces larger quality improvements than switching to a more expensive LLM.
How It Works
A retrieval pipeline request flow: user sends query 'how do I add a team member?' → query preprocessor normalizes case, removes stop words → embedding API converts query to 1536-dim vector → Pinecone similarity search returns top-20 candidates with scores → metadata filter keeps only docs with category='team-management' → cross-encoder reranker scores each of 20 candidates against query → top-5 reranked results passed to context assembler → assembler formats chunks with source citations respecting 8K token limit → formatted context sent to LLM. Total latency: ~300ms (embedding 50ms + vector search 80ms + reranking 150ms + assembly 20ms).
Retrieval Pipeline — Stages from Query to Context
Raw Query
"how do i cancel my subscription"
Query Preprocessing
lowercase, expand: cancel → cancel, end, stop
Embedding
[0.21, -0.83, 0.45...] via text-embedding-3-small
Vector Search
ANN search: top-50 candidates from index
Metadata Filtering
WHERE org_id = 42 AND language = 'en'
Reranking
Cross-encoder rescores top-50 → select top-5
Context Selection
top-5 chunks joined → context ready for LLM
Total pipeline latency
0ms
Raw
2ms
Query
20ms
Embedding
8ms
Vector
3ms
Metadata
45ms
Reranking
2ms
Context
Real-World Example
A 99helpers chatbot serves 500 concurrent users. The retrieval pipeline initially retrieves top-20 candidates, reranks with a cross-encoder, and returns top-5 to the LLM. Profiling shows reranking takes 300ms on average, making total retrieval 450ms. By switching from an API-hosted cross-encoder to a smaller locally-hosted reranker (ms-marco-MiniLM-L-6-v2) and reducing candidates from 20 to 10, reranking time drops to 80ms, total retrieval to 200ms, with only a 2% drop in NDCG@5—an acceptable quality/latency tradeoff.
Common Mistakes
- ✕Skipping query preprocessing—queries with typos, uppercase, or trailing whitespace produce suboptimal embeddings and retrieval.
- ✕Using the same K value for all query types—complex queries may need K=20 while simple factual queries are satisfied with K=3.
- ✕Not instrumenting individual pipeline stages—without per-stage latency and quality metrics, optimization is blind.
Related Terms
RAG Pipeline
A RAG pipeline is the end-to-end sequence of components—ingestion, chunking, embedding, storage, retrieval, and generation—that transforms raw documents into AI-generated answers grounded in a knowledge base.
Reranking
Reranking is a second-stage retrieval step that takes an initial set of candidate documents returned by a fast retrieval method and reorders them using a more accurate but computationally expensive model to improve final result quality.
Query Rewriting
Query rewriting is a technique that transforms a user's original query into an improved version — clearer, more complete, or better suited for retrieval — using an LLM to improve recall and relevance before searching the knowledge base.
Metadata Filtering
Metadata filtering restricts vector search to a subset of documents based on structured attributes — such as category, date, language, or source — enabling more precise retrieval by pre-filtering the candidate pool before similarity search.
Vector Database
A vector database is a purpose-built data store optimized for storing, indexing, and querying high-dimensional numerical vectors (embeddings), enabling fast similarity search across large collections of embedded documents.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →