Retrieval-Augmented Fine-Tuning (RAFT)
Definition
RAFT, introduced by researchers at UC Berkeley, bridges the gap between RAG (which uses general LLMs not specifically trained for retrieval-grounded generation) and domain-specific fine-tuning (which trains on domain knowledge without retrieval context). In RAFT training, each example consists of a question, 1 relevant document ('oracle'), 3-5 irrelevant distractor documents (that look plausible but don't contain the answer), and the answer with a chain-of-thought explanation citing the oracle document. Training on this mixture teaches the LLM two critical skills: (1) identifying which retrieved document contains the answer, and (2) generating an answer that explicitly references the supporting evidence.
Why It Matters
Standard fine-tuning on domain data teaches an LLM what to know but not how to use retrieved context effectively. A general instruction-tuned LLM used in RAG may ignore relevant context in favor of its parametric knowledge, or hallucinate by blending retrieved and memorized information. RAFT trains the model specifically for RAG-style inference, significantly improving its ability to identify relevant documents in a noisy retrieved set and generate faithful, grounded answers. For 99helpers customers in specialized domains (legal, medical, financial software), RAFT-fine-tuned models can dramatically outperform generic LLMs on domain-specific RAG benchmarks.
How It Works
RAFT training data construction: for each document in the knowledge base, generate 3-5 questions answerable from that document using an LLM. For each question, include the relevant document plus 3-5 randomly sampled irrelevant documents as distractors. Generate a chain-of-thought answer that says 'Based on document [X]: [reasoning] → [answer]'. Fine-tune the LLM (e.g., Llama 3 or Mistral) on this dataset using standard supervised fine-tuning. The resulting model is specifically optimized for retrieving and synthesizing from mixed-relevance context. RAFT-fine-tuned models outperform both RAG with a generic model and fine-tuned models without RAG on domain-specific QA benchmarks.
RAFT — Retrieval Augmented Fine-Tuning
Training Data Format
Question
"How do I reset my API key?"
Context Documents
Doc A (relevant), Doc B (distractor), Doc C (relevant)
Answer
"Navigate to Settings > API..."
RAFT Teaches the Model to
Identify relevant passages
Focus on context that answers the question
Ignore distractor documents
Skip documents that look related but are misleading
Generate grounded answers
Cite the specific context used
Standard RAG
No fine-tuning
RAFT
Fine-tuned on RAG triples
Real-World Example
A 99helpers enterprise customer in the healthcare sector deploys a support chatbot answering questions about their EHR software. A general GPT-4 RAG system achieves 72% answer accuracy on their evaluation set. Using RAFT, the team constructs 50,000 training examples from their EHR documentation and fine-tunes Llama 3 8B. The RAFT model achieves 84% accuracy—comparable to GPT-4 RAG—at 1/10th the inference cost per query, enabling sustainable deployment at scale while meeting their latency requirements.
Common Mistakes
- ✕Constructing RAFT training data with only relevant documents (no distractors)—the distractor documents are essential for teaching the model to ignore irrelevant context.
- ✕Fine-tuning on synthetic data without validating that generated questions are diverse and representative of real user queries.
- ✕Using RAFT-fine-tuned models outside the domain they were trained on—domain-specific fine-tuning reduces generalization to other topics.
Related Terms
Retrieval-Augmented Generation
Retrieval-Augmented Generation (RAG) is an AI architecture that enhances large language model responses by first retrieving relevant documents from an external knowledge base and then using that retrieved content as context when generating an answer.
Faithfulness
Faithfulness is a RAG evaluation metric that measures whether the information in a generated answer is fully supported by the retrieved context, quantifying how well the model avoids hallucination when given source documents.
Hallucination
Hallucination in AI refers to when a language model generates confident, plausible-sounding text that is factually incorrect, unsupported by the provided context, or entirely fabricated, posing a major reliability challenge for AI applications.
RAG Evaluation
RAG evaluation is the systematic measurement of a RAG system's quality across multiple dimensions — including retrieval accuracy, answer faithfulness, relevance, and completeness — to identify weaknesses and guide improvement.
Embedding Model
An embedding model is a machine learning model that converts text (or other data) into dense numerical vectors that capture semantic meaning, enabling similarity search and serving as the foundation of RAG retrieval systems.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →