Retrieval Recall
Definition
Retrieval recall is a core evaluation metric for RAG systems that quantifies how completely a retriever captures relevant documents. Formally, recall equals the number of relevant documents retrieved divided by the total number of relevant documents in the corpus. A recall of 1.0 means every relevant document was found; 0.5 means half were missed. In RAG contexts, low recall directly causes LLM failures—if the answer-containing document is not retrieved, no amount of generation sophistication can compensate. Recall is typically measured alongside precision, and the two metrics are often in tension: increasing retrieval breadth improves recall but lowers precision.
Why It Matters
Retrieval recall determines whether your RAG system even has a chance of answering questions correctly. Missing relevant documents at retrieval time is an unrecoverable error—the generation step cannot invent information it was never given. Teams building production RAG systems track recall to identify gaps in their retrieval strategy, whether that means better chunking, improved embeddings, or hybrid search combining dense and sparse methods. High recall is especially critical for compliance and customer support use cases where missing a single relevant policy document could lead to incorrect guidance.
How It Works
To measure recall, you need a ground-truth dataset pairing queries with their relevant documents. The retriever runs each query and returns its top-K results. For each query, you count how many ground-truth relevant documents appear in the retrieved set and divide by the total relevant count. Tools like RAGAS automate this evaluation. Improving recall typically involves expanding K (retrieve more candidates), switching to hybrid retrieval, refining chunking so relevant content isn't fragmented, or adding metadata filters that narrow the search space without excluding relevant content.
Retrieval Recall — Corpus Coverage
Recall Calculation
9
Retrieved Relevant
12
Total Relevant
0.75
Recall
All 12 relevant documents in corpus
Recall vs Precision — increasing K
Higher K improves recall but lowers precision — retrieve more, include more irrelevant.
Real-World Example
A 99helpers customer asks 'How do I reset my account password?' The ground truth labels three documents as relevant: a help center article, a FAQ entry, and a troubleshooting guide. If the retriever returns the help center article and FAQ entry but misses the troubleshooting guide, retrieval recall is 2/3 = 0.67. By switching to hybrid retrieval combining BM25 and dense embeddings, the team retrieves all three documents, achieving recall of 1.0 and eliminating answer gaps.
Common Mistakes
- ✕Optimizing only for recall without monitoring precision leads to noisy context that confuses the LLM with irrelevant content.
- ✕Using K=5 as a fixed number without testing whether relevant documents fall outside the top 5 for difficult queries.
- ✕Evaluating recall on an unrepresentative sample—common queries may have high recall while rare but important queries do not.
Related Terms
Retrieval Precision
Retrieval precision measures the fraction of retrieved documents that are actually relevant to the query. In RAG systems, high precision means the context passed to the LLM contains mostly useful information rather than noise.
RAG Evaluation
RAG evaluation is the systematic measurement of a RAG system's quality across multiple dimensions — including retrieval accuracy, answer faithfulness, relevance, and completeness — to identify weaknesses and guide improvement.
Hybrid Retrieval
Hybrid retrieval combines dense (semantic) and sparse (keyword) search methods to leverage the strengths of both, using a fusion step to merge their results into a single ranked list for better overall retrieval quality.
Mean Reciprocal Rank (MRR)
Mean Reciprocal Rank (MRR) is a retrieval evaluation metric that measures how highly the first relevant document is ranked, averaged across queries. It rewards systems that place the most relevant result near the top of the list.
Faithfulness
Faithfulness is a RAG evaluation metric that measures whether the information in a generated answer is fully supported by the retrieved context, quantifying how well the model avoids hallucination when given source documents.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →