Normalized Discounted Cumulative Gain (NDCG)
Definition
Normalized Discounted Cumulative Gain extends simpler metrics like precision and MRR by accounting for graded relevance—documents can be labeled highly relevant, somewhat relevant, or not relevant, not just binary. DCG sums the relevance grades of retrieved documents, discounting each by the logarithm of its rank position to penalize relevant documents buried lower in the list. NDCG normalizes DCG by the ideal DCG (the best possible ranking), yielding a score between 0 and 1 regardless of query difficulty. NDCG@K evaluates quality in the top-K positions, making it ideal for RAG systems where K documents are passed to the LLM.
Why It Matters
NDCG is the gold standard for evaluating search and retrieval systems when relevance is not binary. In RAG pipelines, not all relevant documents are equally useful—the most authoritative, detailed document should rank first, while partially relevant documents should rank below it. Teams at large-scale 99helpers deployments use NDCG@10 when comparing retrieval strategies, because it penalizes systems that retrieve relevant documents but rank them poorly. A high NDCG score indicates that when users (or LLMs) read the top results in order, they encounter the most relevant content first.
How It Works
Computing NDCG requires graded relevance labels (e.g., 0 = not relevant, 1 = somewhat relevant, 2 = highly relevant). For each query, compute DCG = sum(relevance_grade[i] / log2(i + 1)) for positions i=1..K. Compute IDCG by sorting documents in ideal relevance order. NDCG = DCG / IDCG. Automated grading can use LLM judges with rubrics, reducing the human labeling burden. NDCG is commonly used in reranker fine-tuning, where the reranker is trained to maximize NDCG on labeled training queries.
NDCG — Ranking Quality Metric
Query: "How do I cancel my subscription?"
How to cancel your plan
Billing & subscription FAQ
Account deletion guide
Pricing overview page
DCG
6.32
Ideal DCG
7.14
NDCG Score
0.87
Real-World Example
A 99helpers evaluation dataset grades documents on a 0-2 scale. For the query 'configure webhook notifications,' the retriever returns: highly relevant guide (rank 2), somewhat relevant API docs (rank 1), irrelevant billing page (rank 3). DCG = 1/log2(2) + 2/log2(3) + 0 = 1 + 1.26 + 0 = 2.26. IDCG with ideal ordering = 2/log2(2) + 1/log2(3) = 2 + 0.63 = 2.63. NDCG = 2.26/2.63 = 0.86. After reranking, the highly relevant guide moves to rank 1, pushing NDCG to 0.95.
Common Mistakes
- ✕Using NDCG with binary relevance labels defeats its purpose—it reduces to a less-informative version of precision in that case.
- ✕Computing NDCG@K with a very small K (e.g., K=1) makes it equivalent to MRR, losing the graded ranking signal.
- ✕Ignoring the cost of human labeling graded relevance—synthetic LLM-judged labels can introduce bias if not calibrated.
Related Terms
Mean Reciprocal Rank (MRR)
Mean Reciprocal Rank (MRR) is a retrieval evaluation metric that measures how highly the first relevant document is ranked, averaged across queries. It rewards systems that place the most relevant result near the top of the list.
Retrieval Precision
Retrieval precision measures the fraction of retrieved documents that are actually relevant to the query. In RAG systems, high precision means the context passed to the LLM contains mostly useful information rather than noise.
RAG Evaluation
RAG evaluation is the systematic measurement of a RAG system's quality across multiple dimensions — including retrieval accuracy, answer faithfulness, relevance, and completeness — to identify weaknesses and guide improvement.
Reranking
Reranking is a second-stage retrieval step that takes an initial set of candidate documents returned by a fast retrieval method and reorders them using a more accurate but computationally expensive model to improve final result quality.
Retrieval Recall
Retrieval recall measures the fraction of relevant documents that a retrieval system successfully returns from a corpus. In RAG systems, high recall ensures the LLM has access to all information needed to answer a query correctly.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →