Mean Reciprocal Rank (MRR)
Definition
Mean Reciprocal Rank evaluates the ranking quality of a retrieval system. For a single query, the reciprocal rank is 1/r, where r is the position of the first relevant document in the ranked result list. If the first relevant document is ranked first, the reciprocal rank is 1.0; if it is ranked second, 0.5; if ranked third, 0.33, and so on. MRR averages this score across all queries in an evaluation set. MRR is particularly useful for RAG systems where the top-retrieved document is given the most weight by the language model—if the most relevant document is buried at position 5, the LLM may produce a suboptimal response even if the document is present.
Why It Matters
MRR captures whether your retrieval system surfaces the most relevant content prominently. In RAG pipelines, LLMs give more attention to content appearing early in the context, so a high MRR directly correlates with better answer quality. Teams building 99helpers chatbots use MRR alongside recall and precision to tune their reranking strategies. A reranker that significantly boosts MRR—moving the first relevant result from position 4 to position 1—typically produces measurable improvements in response accuracy even with the same underlying retrieval corpus.
How It Works
To compute MRR, run an evaluation set of queries through the retriever. For each query, scan the ranked result list to find the first relevant document and record its reciprocal rank (1/position). Average these scores across all queries. An MRR of 1.0 means every query's first relevant document was ranked first; 0.5 means on average the first relevant document appears around position 2. Improving MRR typically involves a cross-encoder reranker, better embedding models, or query rewriting to align query representations with document representations.
Mean Reciprocal Rank (MRR) — Worked Example
Q1
How to reset password?
Q2
Billing invoice download
Q3
Cancel my subscription
MRR Calculation
RR(Q1) = 1 / 1 = 1.00
RR(Q2) = 1 / 3 = 0.33
RR(Q3) = 1 / 2 = 0.50
MRR = (1.00 + 0.33 + 0.50) / 3 = 1.83 / 3 = 0.61
0.8 – 1.0
Excellent
0.6 – 0.8
Good
< 0.6
Needs work
Real-World Example
A 99helpers knowledge base retriever returns 10 results per query. For the query 'set up Zapier integration,' the first relevant document appears at rank 3 (reciprocal rank = 0.33). For 'change chatbot color,' it appears at rank 1 (1.0). For 'export conversation logs,' at rank 2 (0.5). MRR = (0.33 + 1.0 + 0.5) / 3 = 0.61. After fine-tuning the embedding model on support data, the first relevant document consistently ranks first across all queries, pushing MRR to 0.92.
Common Mistakes
- ✕Confusing MRR with MAP (Mean Average Precision)—MRR only considers the first relevant document, while MAP considers all relevant documents.
- ✕Using MRR as the only metric when there are multiple relevant documents—it ignores whether all relevant documents are retrieved.
- ✕Evaluating on too small a query set where a few outlier queries disproportionately skew the average.
Related Terms
Retrieval Recall
Retrieval recall measures the fraction of relevant documents that a retrieval system successfully returns from a corpus. In RAG systems, high recall ensures the LLM has access to all information needed to answer a query correctly.
Retrieval Precision
Retrieval precision measures the fraction of retrieved documents that are actually relevant to the query. In RAG systems, high precision means the context passed to the LLM contains mostly useful information rather than noise.
RAG Evaluation
RAG evaluation is the systematic measurement of a RAG system's quality across multiple dimensions — including retrieval accuracy, answer faithfulness, relevance, and completeness — to identify weaknesses and guide improvement.
Reranking
Reranking is a second-stage retrieval step that takes an initial set of candidate documents returned by a fast retrieval method and reorders them using a more accurate but computationally expensive model to improve final result quality.
Normalized Discounted Cumulative Gain (NDCG)
NDCG is a retrieval ranking metric that rewards placing highly relevant documents near the top of results, with a logarithmic penalty for lower positions. It captures both relevance grades and ranking quality in a single normalized score.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →