Retrieval-Augmented Generation (RAG)

Multi-Query Retrieval

Definition

Multi-query retrieval is a query expansion strategy that uses an LLM to generate multiple reformulations of the user's original question, performs separate vector searches for each formulation, and merges the resulting document sets (typically using union with deduplication) to produce a richer candidate pool. The motivation is that any single query formulation captures only one way of expressing the information need — different phrasings retrieve different (but potentially equally relevant) documents. By retrieving across multiple phrasings, the system achieves higher recall at the cost of retrieving more candidates (some of which may be less relevant).

Why It Matters

Multi-query retrieval is particularly valuable for complex, ambiguous, or underspecified queries where a single formulation may miss relevant documents. Users asking about nuanced topics often benefit from having their question rephrased in both technical and lay language, in both general and specific form, and from both question and statement perspectives. The merged document pool provides the LLM with a broader information base for answering complex questions. Multi-query is especially effective when combined with reranking — the larger candidate pool from multiple queries is reranked to select the most relevant documents.

How It Works

Multi-query retrieval is implemented by prompting an LLM to generate N alternative formulations (typically 3-5) of the user's query. Example prompt: 'Generate 4 alternative versions of the following question that capture the same information need from different angles: {question}'. Each alternative is used for a separate vector search (and optionally keyword search). Results are merged with deduplication — if the same document chunk appears in multiple query results, it is included once. The merged set is either used directly (top-k from each query) or passed to a reranker to select the final context.

Multi-Query Retrieval — Expanded Recall via Sub-queries

Original Query“Why is my chatbot slow?”

LLM Query Expansion

Generates 3 sub-queries

SQ1

chatbot response latency causes

1Latency bottlenecks in AI APIs

2Inference server load guide

3Vector search speed tips

SQ2

slow AI inference optimization

1Optimize LLM inference pipeline

2Latency bottlenecks in AI APIs

3GPU memory management

SQ3

vector search performance tuning

1Vector search speed tips

2HNSW index configuration

3ANN algorithm comparison

Deduplicated Union — 6 unique docs (vs 3 from single query)

Latency bottlenecks in AI APIs

Optimize LLM inference pipeline

Vector search speed tips

Inference server load guide

HNSW index configuration

GPU memory management

Recall improvement3 docs → 6 docs (+100%)

Real-World Example

A 99helpers customer implements multi-query retrieval for their complex B2B SaaS product. For the query 'Can my team use different account permissions?', multi-query generates: 'team member role-based access', 'user permission levels', 'admin and viewer account types', and 'sharing settings multiple users'. The four queries collectively retrieve 18 unique relevant chunks versus 4 for the original query. The reranker selects the 5 most relevant for the LLM context. Answer completeness for multi-part permission questions improves significantly, and customer queries requiring escalation on this topic decrease by 40%.

Common Mistakes

✕Generating too many query variations — 8+ queries create noise and latency without proportional recall improvement; 3-5 variations is typically optimal
✕Not deduplicating results before passing to the LLM — duplicate chunks waste context window space and may cause the LLM to over-weight repeated information
✕Applying multi-query to simple, specific questions — 'what is your refund policy?' does not benefit from multiple reformulations; apply multi-query selectively to complex queries

Related Terms

Query Expansion

Query expansion is a retrieval technique that augments the original user query with related terms, synonyms, or alternative phrasings before search, improving recall by retrieving relevant documents that would not match the original query vocabulary.

Query Rewriting

Query rewriting is a technique that transforms a user's original query into an improved version — clearer, more complete, or better suited for retrieval — using an LLM to improve recall and relevance before searching the knowledge base.

Hypothetical Document Embedding

Hypothetical Document Embedding (HyDE) is a RAG technique that improves retrieval by having an LLM generate a hypothetical document that would answer the user's query, then using that document's embedding rather than the query embedding for similarity search.

Reranking

Reranking is a second-stage retrieval step that takes an initial set of candidate documents returned by a fast retrieval method and reorders them using a more accurate but computationally expensive model to improve final result quality.

Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) is an AI architecture that enhances large language model responses by first retrieving relevant documents from an external knowledge base and then using that retrieved content as context when generating an answer.

← Retrieval-Augmented Generation (RAG)← Glossary Hub

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →