Retrieval-Augmented Generation (RAG)
Retrieval-Augmented Generation (RAG) is the dominant architecture for building AI systems that answer questions based on specific knowledge sources rather than relying solely on pre-trained model weights. This category covers every component of a RAG pipeline — vector databases, embeddings, chunking strategies, reranking, and context injection — as well as advanced patterns like hybrid search and agentic RAG. Mastering these terms is essential for anyone building production-ready AI assistants.
71 terms in this category
Adaptive RAG
Adaptive RAG dynamically selects the retrieval strategy—no retrieval, single-step retrieval, or multi-step iterative retrieval—based on the complexity of each query, optimizing cost and latency without sacrificing answer quality.
Agentic RAG
Agentic RAG extends basic RAG with autonomous planning and multi-step reasoning, where the AI agent decides which sources to query, in what order, and whether additional retrieval steps are needed before generating a final answer.
Approximate Nearest Neighbor
Approximate Nearest Neighbor (ANN) search finds vectors that are close to a query vector with high probability but without guaranteeing exactness, enabling fast similarity search across millions of vectors at the cost of small accuracy tradeoffs.
Bi-Encoder
A bi-encoder is a neural network architecture that independently encodes queries and documents into separate embedding vectors, enabling fast offline document indexing and real-time similarity search for scalable retrieval.
BM25
BM25 (Best Match 25) is the industry-standard sparse retrieval algorithm that scores documents against a query based on term frequency, inverse document frequency, and document length normalization, widely used in search engines and hybrid RAG systems.
Chroma
Chroma is a lightweight, open-source vector database designed for rapid prototyping and development of AI applications, offering a simple Python API and in-memory or persistent storage modes.
Chunk Overlap
Chunk overlap is a chunking strategy where consecutive document chunks share a portion of overlapping text, ensuring that information spanning chunk boundaries is captured in at least one complete chunk.
Chunk Size
Chunk size is the maximum number of tokens or characters in each document segment created during the chunking phase of RAG indexing, controlling the granularity of retrieval and the amount of context available per retrieved chunk.
Context Window
A context window is the maximum amount of text (measured in tokens) that a language model can process in a single inference call, determining how much retrieved content, conversation history, and instructions can be included in a RAG prompt.
Contextual Compression
Contextual compression is a RAG technique that extracts or summarizes only the portions of retrieved documents that are relevant to the user's query, reducing the amount of irrelevant text passed to the LLM and improving context quality.
Corrective RAG (CRAG)
Corrective RAG (CRAG) adds a self-evaluation step that assesses retrieved document relevance and automatically triggers web search or knowledge base expansion when initial retrieval is deemed insufficient.
Cosine Similarity
Cosine similarity is a mathematical metric that measures the similarity between two vectors by calculating the cosine of the angle between them, producing a score from -1 to 1 where 1 indicates identical direction and is widely used in RAG and semantic search.
Cross-Encoder
A cross-encoder is a transformer model that processes a query and a document together in a single forward pass, producing a relevance score that captures fine-grained query-document interactions for high-quality reranking.
Data Connector
A data connector in RAG systems is an integration component that ingests content from a specific external source—such as Confluence, Notion, Google Drive, or Zendesk—and transforms it into a format suitable for embedding and storage in a vector database.
Dense Retrieval
Dense retrieval is a retrieval approach that encodes both queries and documents into dense embedding vectors and finds relevant documents by computing vector similarity, enabling semantic matching beyond exact keyword overlap.
Document Chunking
Document chunking is the process of splitting large documents into smaller text segments before embedding and indexing for RAG, balancing chunk size to preserve context while staying within embedding model limits and enabling precise retrieval.
Document Loader
A document loader is a component that reads raw files from a file system, URL, or API and converts them into a standardized Document object with text content and metadata, serving as the first step in a RAG ingestion pipeline.
Embedding Cache
An embedding cache stores previously computed vector embeddings so identical or similar text does not need to be re-embedded, reducing API costs, latency, and load on embedding model infrastructure.
Embedding Model
An embedding model is a machine learning model that converts text (or other data) into dense numerical vectors that capture semantic meaning, enabling similarity search and serving as the foundation of RAG retrieval systems.
Faithfulness
Faithfulness is a RAG evaluation metric that measures whether the information in a generated answer is fully supported by the retrieved context, quantifying how well the model avoids hallucination when given source documents.
Generation Pipeline
A generation pipeline is the LLM-side workflow in RAG that assembles retrieved context into a prompt, calls the language model, and post-processes the output into a final user-facing answer.
GraphRAG
GraphRAG combines retrieval-augmented generation with knowledge graph structures, enabling multi-hop reasoning across connected entities and relationships rather than retrieving isolated text chunks.
Grounding
Grounding in AI refers to anchoring a language model's responses to specific, verifiable source documents or data, reducing hallucination by ensuring the model draws on retrieved evidence rather than relying on potentially incorrect parametric knowledge.
Hallucination
Hallucination in AI refers to when a language model generates confident, plausible-sounding text that is factually incorrect, unsupported by the provided context, or entirely fabricated, posing a major reliability challenge for AI applications.
Hybrid Retrieval
Hybrid retrieval combines dense (semantic) and sparse (keyword) search methods to leverage the strengths of both, using a fusion step to merge their results into a single ranked list for better overall retrieval quality.
Hypothetical Document Embedding
Hypothetical Document Embedding (HyDE) is a RAG technique that improves retrieval by having an LLM generate a hypothetical document that would answer the user's query, then using that document's embedding rather than the query embedding for similarity search.
Indexing Pipeline
An indexing pipeline is the offline data processing workflow that transforms raw documents into searchable vector embeddings, running during knowledge base setup and when content is updated.
Inverted Index
An inverted index is a data structure that maps each unique term in a document collection to the list of documents containing that term, enabling fast full-text keyword search and powering BM25 and other sparse retrieval algorithms.
Knowledge Graph RAG
Knowledge Graph RAG enhances retrieval by indexing document knowledge as a structured graph of entities and relationships, enabling precise lookup of specific facts and multi-hop traversal across connected information.
LLM-as-Judge
LLM-as-judge is an evaluation technique where a language model assesses the quality of RAG outputs—scoring faithfulness, relevance, and completeness—enabling scalable automated evaluation without human labelers for every query.
Long-Context RAG
Long-context RAG leverages LLMs with large context windows (100K+ tokens) to process many or entire documents at once, reducing reliance on retrieval precision but increasing cost and latency compared to traditional top-K retrieval.
Mean Reciprocal Rank (MRR)
Mean Reciprocal Rank (MRR) is a retrieval evaluation metric that measures how highly the first relevant document is ranked, averaged across queries. It rewards systems that place the most relevant result near the top of the list.
Metadata Filtering
Metadata filtering restricts vector search to a subset of documents based on structured attributes — such as category, date, language, or source — enabling more precise retrieval by pre-filtering the candidate pool before similarity search.
Multi-Query Retrieval
Multi-query retrieval generates multiple alternative phrasings of the user's question and retrieves documents for each phrasing separately, then merges results to achieve higher recall than any single query formulation would provide.
Multimodal RAG
Multimodal RAG extends retrieval-augmented generation to handle images, diagrams, tables, and other non-text content alongside text, enabling AI systems to retrieve and reason over mixed-media knowledge bases.
Vector Database Namespace
A namespace in vector databases is a logical partition that isolates groups of vectors within the same index, enabling multi-tenant RAG applications where different users or organizations have separate, private knowledge bases.
Normalized Discounted Cumulative Gain (NDCG)
NDCG is a retrieval ranking metric that rewards placing highly relevant documents near the top of results, with a logarithmic penalty for lower positions. It captures both relevance grades and ranking quality in a single normalized score.
Parent-Child Chunking
Parent-child chunking indexes small child chunks for precise retrieval but returns their larger parent chunk as context, combining fine-grained retrieval accuracy with broad contextual information for the generation step.
Parent Document Retrieval
Parent document retrieval is a RAG strategy that indexes small chunks for precise retrieval but returns the larger parent document (or section) to the LLM as context, balancing retrieval precision with sufficient context for answer generation.
pgvector
pgvector is a PostgreSQL extension that adds vector similarity search capabilities to Postgres, enabling teams to run RAG retrieval directly in their existing database without a separate vector store.
Pinecone
Pinecone is a fully managed vector database service designed for production machine learning applications, providing high-performance similarity search with simple APIs and automatic scaling for RAG and semantic search systems.
Qdrant
Qdrant is an open-source vector database and search engine built in Rust, offering high performance, rich filtering, sparse vector support for hybrid search, and flexible deployment from local to cloud for production RAG systems.
Query Decomposition
Query decomposition breaks a complex, multi-part user question into simpler sub-queries that can each be answered independently, improving RAG retrieval by matching each sub-query against relevant document segments.
Query Expansion
Query expansion is a retrieval technique that augments the original user query with related terms, synonyms, or alternative phrasings before search, improving recall by retrieving relevant documents that would not match the original query vocabulary.
Query Rewriting
Query rewriting is a technique that transforms a user's original query into an improved version — clearer, more complete, or better suited for retrieval — using an LLM to improve recall and relevance before searching the knowledge base.
RAG Evaluation
RAG evaluation is the systematic measurement of a RAG system's quality across multiple dimensions — including retrieval accuracy, answer faithfulness, relevance, and completeness — to identify weaknesses and guide improvement.
RAG Fusion
RAG Fusion is a retrieval technique that generates multiple query variations, retrieves documents for each, and uses Reciprocal Rank Fusion (RRF) to merge the ranked result lists, improving overall retrieval coverage and quality.
RAG Pipeline
A RAG pipeline is the end-to-end sequence of components—ingestion, chunking, embedding, storage, retrieval, and generation—that transforms raw documents into AI-generated answers grounded in a knowledge base.
RAG Triad
The RAG Triad is an evaluation framework that assesses three core quality dimensions of RAG systems: context relevance (are retrieved documents relevant?), groundedness (is the answer based on the context?), and answer relevance (does the answer address the question?).
Recursive Chunking
Recursive chunking splits documents hierarchically using a priority list of separators—first by double newlines, then single newlines, then sentences, then words—ensuring chunks respect natural structural boundaries before falling back to finer splits.
Reranking
Reranking is a second-stage retrieval step that takes an initial set of candidate documents returned by a fast retrieval method and reorders them using a more accurate but computationally expensive model to improve final result quality.
Retrieval-Augmented Fine-Tuning (RAFT)
RAFT (Retrieval-Augmented Fine-Tuning) trains LLMs on examples that mix relevant and irrelevant retrieved documents, teaching the model to identify and use relevant context while ignoring distractors—improving RAG performance in specific domains.
Retrieval-Augmented Generation
Retrieval-Augmented Generation (RAG) is an AI architecture that enhances large language model responses by first retrieving relevant documents from an external knowledge base and then using that retrieved content as context when generating an answer.
Retrieval Pipeline
A retrieval pipeline is the online query-time workflow that transforms a user question into a ranked set of relevant document chunks, serving as the information retrieval stage of a RAG system.
Retrieval Precision
Retrieval precision measures the fraction of retrieved documents that are actually relevant to the query. In RAG systems, high precision means the context passed to the LLM contains mostly useful information rather than noise.
Retrieval Recall
Retrieval recall measures the fraction of relevant documents that a retrieval system successfully returns from a corpus. In RAG systems, high recall ensures the LLM has access to all information needed to answer a query correctly.
Self-RAG
Self-RAG is an advanced RAG framework where the language model learns to decide when to retrieve, evaluate the relevance of retrieved passages, and assess the quality and groundedness of its own generated responses.
Late Chunking
Late chunking embeds an entire document through a long-context encoder before splitting it into retrievable chunks, allowing each chunk's embedding to capture full document context rather than just local sentence context.
Semantic Chunking
Semantic chunking splits documents into segments based on meaning boundaries—grouping sentences that discuss the same topic together—rather than fixed character counts. This produces more coherent, self-contained chunks that improve retrieval quality.
Semantic Similarity
Semantic similarity is a measure of how alike two pieces of text are in meaning, regardless of the exact words used, computed by comparing their embedding vectors using metrics such as cosine similarity.
Sentence Window Retrieval
Sentence window retrieval indexes individual sentences for high-precision embedding and retrieval, but expands each retrieved sentence to include a window of surrounding sentences before passing to the LLM, providing both precision and context.
Sliding Window Chunking
Sliding window chunking splits documents into overlapping segments by advancing a fixed-size window across the text. Overlap between consecutive chunks ensures that information near chunk boundaries is captured in multiple chunks, reducing information loss.
Sparse Retrieval
Sparse retrieval is a search approach based on exact or weighted keyword matching, where documents and queries are represented as high-dimensional sparse vectors with most values being zero, and similarity is measured by term overlap.
Step-Back Prompting
Step-back prompting is a RAG technique that reformulates a specific, narrow query into a more general question before retrieval, improving recall for queries where the exact answer lives in higher-level conceptual documents.
Text Embedding
A text embedding is a numerical vector representation of text that encodes its semantic meaning, enabling mathematical comparison of text similarity. Text embeddings are the foundation of semantic search and RAG retrieval.
TF-IDF
TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical weighting scheme that scores how important a term is to a specific document relative to a collection, used in keyword search and as the conceptual foundation for BM25.
Token Budget
A token budget is the maximum number of tokens allocated to different sections of an LLM prompt in a RAG system—system instructions, retrieved context, and conversation history—ensuring the total stays within the model's context window limit.
Vector Upsert
An upsert (update + insert) in vector databases writes a vector and its metadata, inserting it if the ID does not exist or replacing it if the ID already exists, enabling efficient knowledge base updates without full re-indexing.
Vector Database
A vector database is a purpose-built data store optimized for storing, indexing, and querying high-dimensional numerical vectors (embeddings), enabling fast similarity search across large collections of embedded documents.
Vector Quantization
Vector quantization compresses high-dimensional embedding vectors into smaller representations using techniques like product quantization or scalar quantization, reducing vector database storage costs and improving query throughput at the cost of some retrieval accuracy.
Weaviate
Weaviate is an open-source vector database with built-in support for hybrid search, multi-tenancy, and automatic vectorization, popular in enterprise RAG deployments for its flexibility and self-hosting capability.