Retrieval-Augmented Generation (RAG)

RAG Pipeline

Definition

A RAG pipeline orchestrates two main workflows: the offline indexing pipeline and the online query pipeline. The indexing pipeline runs during knowledge base setup and updates: documents are loaded from sources, chunked into segments, embedded into vectors, and stored in a vector database with metadata. The query pipeline runs at inference time: a user query is embedded, the vector database is searched for relevant chunks, retrieved chunks are assembled into a context prompt, and a language model generates a grounded answer. These two pipelines must be designed together—decisions in indexing (chunk size, metadata schema) directly constrain what is possible in querying.

Why It Matters

The RAG pipeline is the production unit of deployment for AI knowledge systems. Individual components like vector databases, embedding models, and LLMs each have their own documentation and APIs, but it is the pipeline that determines how well they work together. For 99helpers customers, the RAG pipeline is the architecture that connects their knowledge base content to their chatbot's answers. Well-designed pipelines handle failures gracefully (retrieval returning no results, the LLM refusing to answer), scale to high query volumes, and include observability instrumentation to track performance metrics at each stage.

How It Works

Modern RAG pipeline frameworks like LlamaIndex and LangChain provide high-level abstractions for each stage. A typical LlamaIndex pipeline: (1) SimpleDirectoryReader loads documents; (2) SentenceSplitter chunks them; (3) OpenAIEmbedding embeds chunks; (4) VectorStoreIndex stores vectors; (5) VectorIndexRetriever retrieves top-K chunks; (6) ResponseSynthesizer passes context to an LLM; (7) QueryEngine ties retrieval and synthesis together. Each component is swappable, allowing teams to upgrade individual stages (e.g., switch embedding models) without rewriting the pipeline. Production pipelines add monitoring, retry logic, caching, and evaluation hooks.

RAG Pipeline — Indexing and Query Phases

Indexing PhaseOffline — run once or on update

Documents

PDFs, URLs, text

Loader

Parse & extract

Chunker

Split by size

Embedder

Model → vectors

Vector Store

pgvector / Pinecone

Query PhaseOnline — runs per user request

User Query

Natural language

Embed Query

Same model

Vector Search

ANN similarity

Top-k Chunks

Retrieved docs

Prompt Builder

Query + context

LLM

Generate

Answer

Grounded output

Real-World Example

A 99helpers customer builds their AI support chatbot as a RAG pipeline: (1) data connectors pull from Zendesk and Confluence; (2) RecursiveCharacterTextSplitter chunks content at 512 tokens with 50-token overlap; (3) text-embedding-3-small creates vectors; (4) Pinecone stores them with metadata (source, category, last-updated); (5) at query time, the user question is embedded and top-5 chunks retrieved with metadata filtering by category; (6) GPT-4o synthesizes the answer citing source documents. Adding a reranker between steps 5 and 6 improves precision, demonstrating the pipeline's modularity.

Common Mistakes

✕Treating the RAG pipeline as a one-time setup rather than a continuously maintained system requiring monitoring and updates.
✕Coupling all pipeline stages tightly, making it impossible to upgrade one component (e.g., the embedding model) without rebuilding everything.
✕Skipping evaluation pipelines—without measuring retrieval and generation quality, improvements are guesswork.

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →

RAG Pipeline

Definition

Why It Matters

How It Works

Real-World Example

Common Mistakes

Related Terms

Indexing Pipeline

Retrieval Pipeline

Generation Pipeline

Retrieval-Augmented Generation

Vector Database

Ready to build your AI chatbot?