Retrieval-Augmented Generation (RAG)

RAG Triad

Definition

The RAG Triad, introduced by TruEra and implemented in the TruLens framework, provides three complementary evaluation metrics that together characterize RAG system quality. Context relevance measures whether retrieved chunks are actually relevant to the query—high context relevance means the retriever is returning focused, useful content. Groundedness (or faithfulness) measures whether every claim in the generated answer can be traced to a specific statement in the retrieved context—high groundedness means the LLM is not hallucinating. Answer relevance measures whether the answer addresses what the user asked—high answer relevance means the system understood the question and responded appropriately. A high-quality RAG system scores well on all three dimensions.

Why It Matters

Each dimension of the RAG Triad catches a different failure mode. Context relevance catches poor retrieval—when the system retrieves plausible but not actually useful documents. Groundedness catches hallucination—when the LLM generates content not supported by context. Answer relevance catches misunderstanding or off-topic responses. Teams deploying 99helpers chatbots use the RAG Triad to diagnose where quality problems originate: a low context relevance score points to retrieval problems (fix the embeddings or chunking), low groundedness points to generation problems (fix the prompt or model), and low answer relevance points to query understanding problems (fix preprocessing or try query rewriting).

How It Works

TruLens implements the RAG Triad using LLM-as-judge evaluators. For context relevance, the judge rates each retrieved chunk: 'How relevant is this chunk to the query?' For groundedness, the judge checks each sentence in the answer against the context: 'Is this claim directly supported by the provided context?' For answer relevance, the judge rates the answer: 'Does this answer address the user's question?' Each metric produces a score from 0-1. Aggregate scores across an evaluation dataset reveal which component of the pipeline is the weakest link. TruLens also supports real-time monitoring, logging the RAG Triad scores for every production query and alerting when scores drop below thresholds.

RAG Triad — Three Checks for Quality Responses

Context Relevance

0.89

Retrieved context relevant to query?

Groundedness

0.84

Answer supported by context?

Answer Relevance

0.91

Answer addresses question?

Query → Context

Is context on-topic?

Context → Answer

Is answer grounded?

Answer → Query

Does answer address query?

All Three Must Pass

Failing any single check indicates a breakdown — poor retrieval, hallucination, or off-topic answer

Real-World Example

A 99helpers team notices their chatbot satisfaction scores dropped after a knowledge base expansion. Running the RAG Triad evaluation on 1,000 recent queries reveals: context relevance = 0.85 (acceptable), groundedness = 0.91 (good), answer relevance = 0.67 (poor). The low answer relevance score indicates the system is generating answers that don't address what users asked. Investigation reveals that the new documents added to the knowledge base contain a lot of tangentially related content, causing the retriever to surface contextually irrelevant but topically adjacent documents. The fix: improve chunking and add metadata filters to improve context relevance, which indirectly improves answer relevance.

Common Mistakes

✕Optimizing only one metric in isolation—a high groundedness score with low context relevance still produces unhelpful answers, just grounded in the wrong documents.
✕Running the evaluation on a non-representative sample—if the evaluation set doesn't include the query types where the system fails, the metrics will be misleadingly high.
✕Ignoring the cost of running the RAG Triad in production on every query—LLM-based evaluation adds significant cost and latency; use sampling or batch evaluation instead.

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →

RAG Triad

Definition

Why It Matters

How It Works

Real-World Example

Common Mistakes

Related Terms

RAG Evaluation

Faithfulness

Retrieval Precision

LLM-as-Judge

Hallucination

Ready to build your AI chatbot?