Retrieval-Augmented Generation (RAG)

Retrieval Pipeline

Definition

The retrieval pipeline executes at query time, taking a user's question as input and producing a set of relevant document chunks as output for the generation step. Core stages are: (1) query processing—cleaning, language detection, expansion, or rewriting; (2) query embedding—converting the query to a vector using the same embedding model used during indexing; (3) vector search—querying the vector database for approximate nearest neighbors; (4) optional metadata filtering—restricting results by category, date, source, or other attributes; (5) optional reranking—using a cross-encoder to reorder results by relevance; (6) context assembly—combining retrieved chunks into a prompt context respecting the LLM's token limit. Each stage can be independently optimized.

Why It Matters

The retrieval pipeline is the quality bottleneck of most RAG systems. No matter how powerful the generation model, it cannot produce accurate answers if the retrieval pipeline fails to surface the right documents. Monitoring retrieval pipeline metrics—retrieval latency, cache hit rate, recall@K, reranker lift—is essential for maintaining chatbot quality at scale. For 99helpers customers with large, diverse knowledge bases, retrieval pipeline optimization (tuning K, adding reranking, implementing metadata filters) often produces larger quality improvements than switching to a more expensive LLM.

How It Works

A retrieval pipeline request flow: user sends query 'how do I add a team member?' → query preprocessor normalizes case, removes stop words → embedding API converts query to 1536-dim vector → Pinecone similarity search returns top-20 candidates with scores → metadata filter keeps only docs with category='team-management' → cross-encoder reranker scores each of 20 candidates against query → top-5 reranked results passed to context assembler → assembler formats chunks with source citations respecting 8K token limit → formatted context sent to LLM. Total latency: ~300ms (embedding 50ms + vector search 80ms + reranking 150ms + assembly 20ms).

Retrieval Pipeline — Stages from Query to Context

Raw Query

"how do i cancel my subscription"

0ms

Query Preprocessing

lowercase, expand: cancel → cancel, end, stop

2ms

Embedding

[0.21, -0.83, 0.45...] via text-embedding-3-small

20ms

Vector Search

ANN search: top-50 candidates from index

8ms

Metadata Filtering

WHERE org_id = 42 AND language = 'en'

3ms

Reranking

Cross-encoder rescores top-50 → select top-5

45ms

Context Selection

top-5 chunks joined → context ready for LLM

2ms

Total pipeline latency

0ms

Raw

2ms

Query

20ms

Embedding

8ms

Vector

3ms

Metadata

45ms

Reranking

2ms

Context

= ~80ms

Real-World Example

A 99helpers chatbot serves 500 concurrent users. The retrieval pipeline initially retrieves top-20 candidates, reranks with a cross-encoder, and returns top-5 to the LLM. Profiling shows reranking takes 300ms on average, making total retrieval 450ms. By switching from an API-hosted cross-encoder to a smaller locally-hosted reranker (ms-marco-MiniLM-L-6-v2) and reducing candidates from 20 to 10, reranking time drops to 80ms, total retrieval to 200ms, with only a 2% drop in NDCG@5—an acceptable quality/latency tradeoff.

Common Mistakes

✕Skipping query preprocessing—queries with typos, uppercase, or trailing whitespace produce suboptimal embeddings and retrieval.
✕Using the same K value for all query types—complex queries may need K=20 while simple factual queries are satisfied with K=3.
✕Not instrumenting individual pipeline stages—without per-stage latency and quality metrics, optimization is blind.

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →

Retrieval Pipeline

Definition

Why It Matters

How It Works

Real-World Example

Common Mistakes

Related Terms

RAG Pipeline

Reranking

Query Rewriting

Metadata Filtering

Vector Database

Ready to build your AI chatbot?