RAG Triad
Definition
The RAG Triad, introduced by TruEra and implemented in the TruLens framework, provides three complementary evaluation metrics that together characterize RAG system quality. Context relevance measures whether retrieved chunks are actually relevant to the query—high context relevance means the retriever is returning focused, useful content. Groundedness (or faithfulness) measures whether every claim in the generated answer can be traced to a specific statement in the retrieved context—high groundedness means the LLM is not hallucinating. Answer relevance measures whether the answer addresses what the user asked—high answer relevance means the system understood the question and responded appropriately. A high-quality RAG system scores well on all three dimensions.
Why It Matters
Each dimension of the RAG Triad catches a different failure mode. Context relevance catches poor retrieval—when the system retrieves plausible but not actually useful documents. Groundedness catches hallucination—when the LLM generates content not supported by context. Answer relevance catches misunderstanding or off-topic responses. Teams deploying 99helpers chatbots use the RAG Triad to diagnose where quality problems originate: a low context relevance score points to retrieval problems (fix the embeddings or chunking), low groundedness points to generation problems (fix the prompt or model), and low answer relevance points to query understanding problems (fix preprocessing or try query rewriting).
How It Works
TruLens implements the RAG Triad using LLM-as-judge evaluators. For context relevance, the judge rates each retrieved chunk: 'How relevant is this chunk to the query?' For groundedness, the judge checks each sentence in the answer against the context: 'Is this claim directly supported by the provided context?' For answer relevance, the judge rates the answer: 'Does this answer address the user's question?' Each metric produces a score from 0-1. Aggregate scores across an evaluation dataset reveal which component of the pipeline is the weakest link. TruLens also supports real-time monitoring, logging the RAG Triad scores for every production query and alerting when scores drop below thresholds.
RAG Triad — Three Checks for Quality Responses
Context Relevance
0.89
Retrieved context relevant to query?
Groundedness
0.84
Answer supported by context?
Answer Relevance
0.91
Answer addresses question?
Query → Context
Is context on-topic?
Context → Answer
Is answer grounded?
Answer → Query
Does answer address query?
All Three Must Pass
Failing any single check indicates a breakdown — poor retrieval, hallucination, or off-topic answer
Real-World Example
A 99helpers team notices their chatbot satisfaction scores dropped after a knowledge base expansion. Running the RAG Triad evaluation on 1,000 recent queries reveals: context relevance = 0.85 (acceptable), groundedness = 0.91 (good), answer relevance = 0.67 (poor). The low answer relevance score indicates the system is generating answers that don't address what users asked. Investigation reveals that the new documents added to the knowledge base contain a lot of tangentially related content, causing the retriever to surface contextually irrelevant but topically adjacent documents. The fix: improve chunking and add metadata filters to improve context relevance, which indirectly improves answer relevance.
Common Mistakes
- ✕Optimizing only one metric in isolation—a high groundedness score with low context relevance still produces unhelpful answers, just grounded in the wrong documents.
- ✕Running the evaluation on a non-representative sample—if the evaluation set doesn't include the query types where the system fails, the metrics will be misleadingly high.
- ✕Ignoring the cost of running the RAG Triad in production on every query—LLM-based evaluation adds significant cost and latency; use sampling or batch evaluation instead.
Related Terms
RAG Evaluation
RAG evaluation is the systematic measurement of a RAG system's quality across multiple dimensions — including retrieval accuracy, answer faithfulness, relevance, and completeness — to identify weaknesses and guide improvement.
Faithfulness
Faithfulness is a RAG evaluation metric that measures whether the information in a generated answer is fully supported by the retrieved context, quantifying how well the model avoids hallucination when given source documents.
Retrieval Precision
Retrieval precision measures the fraction of retrieved documents that are actually relevant to the query. In RAG systems, high precision means the context passed to the LLM contains mostly useful information rather than noise.
LLM-as-Judge
LLM-as-judge is an evaluation technique where a language model assesses the quality of RAG outputs—scoring faithfulness, relevance, and completeness—enabling scalable automated evaluation without human labelers for every query.
Hallucination
Hallucination in AI refers to when a language model generates confident, plausible-sounding text that is factually incorrect, unsupported by the provided context, or entirely fabricated, posing a major reliability challenge for AI applications.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →