Retrieval-Augmented Generation (RAG)

Data Connector

Definition

Data connectors are the ingestion layer of RAG architectures, responsible for pulling content from diverse source systems and normalizing it into a unified document format. Each connector handles the specifics of a particular source: authentication (OAuth, API keys), API pagination, content extraction (parsing Confluence wiki markup, extracting Notion blocks, reading Google Docs), deduplication across sources, and incremental sync (tracking what has changed since the last ingestion run). The extracted content is then passed to the chunking and embedding pipeline. Frameworks like LlamaIndex provide 160+ pre-built data connectors (called LlamaHub readers), and LangChain offers document loaders for many common sources.

Why It Matters

Enterprises and SaaS teams rarely store all their knowledge in a single system. Support knowledge spans Confluence wiki pages, Zendesk articles, GitHub READMEs, Notion documentation, Slack threads, and PDF attachments. Without data connectors, engineering teams must write custom ingestion scripts for each source—a significant maintenance burden. For 99helpers customers who want their chatbot to answer from all their internal knowledge sources, data connectors enable a 'connect once, retrieve everywhere' model, dramatically reducing the integration effort needed to build a comprehensive knowledge base.

How It Works

A data connector typically implements a load() or get_documents() interface that returns a list of Document objects, each with a text field (the content) and a metadata field (source URL, last modified date, author, etc.). Connectors handle source-specific concerns: the Confluence connector authenticates via API token, paginates through spaces and pages, and converts wiki markup to clean text. The Google Drive connector uses OAuth to access Drive files, extracts text from Docs and PDFs, and tracks file modification dates for incremental sync. After loading, documents flow into the standard RAG pipeline: chunking, embedding, and vector database upsert.

Data Connector Ecosystem

Data Sources

Notiondocs

Google Drivefiles

Confluencewiki

GitHubcode

Slackmessages

PostgreSQLdb

PDF Filesbinary

Connector Pipeline

Each connector handles:

ExtractAuth, pagination, rate limits

TransformFormat normalization → plain text

LoadStructured Document objects + metadata

Parsed Document

content: "How to reset..."

source: "notion/faq"

date: "2025-03-01"

author: "support@co"

RAG Pipeline

Chunker → Embedder

Real-World Example

A 99helpers customer wants their AI chatbot to answer questions from three sources: their Zendesk Help Center, a Confluence wiki, and a Google Drive folder of PDF product specs. Using LlamaHub's data connectors, the team configures three connectors with appropriate credentials. Each connector fetches all documents from its source, returns normalized Document objects, and feeds them into the shared chunking and embedding pipeline. When documents update in any source, the connectors' incremental sync detects changes and re-indexes only modified content, keeping the chatbot current without full re-indexing.

Common Mistakes

✕Building custom connectors for sources that already have open-source connector libraries, duplicating significant engineering effort.
✕Ignoring incremental sync—re-indexing entire sources on every run is expensive and introduces lag for large knowledge bases.
✕Storing connector credentials in code instead of using environment variables or a secrets manager.

Related Terms

Document Loader

A document loader is a component that reads raw files from a file system, URL, or API and converts them into a standardized Document object with text content and metadata, serving as the first step in a RAG ingestion pipeline.

Document Ingestion

Document ingestion is the process of importing, parsing, and indexing external documents — PDFs, Word files, web pages, CSVs, and more — into a knowledge base or AI retrieval system. It transforms raw files into searchable, retrievable content that an AI can use to answer questions.

Indexing Pipeline

An indexing pipeline is the offline data processing workflow that transforms raw documents into searchable vector embeddings, running during knowledge base setup and when content is updated.

Retrieval Pipeline

A retrieval pipeline is the online query-time workflow that transforms a user question into a ranked set of relevant document chunks, serving as the information retrieval stage of a RAG system.

PDF Ingestion

PDF ingestion is the process of extracting text from PDF files and indexing them into a knowledge base. PDFs are the most common document format for product manuals, policies, and technical guides — but extracting clean, structured text from them requires specialized parsing to handle layouts, fonts, columns, and embedded images.

← Retrieval-Augmented Generation (RAG)← Glossary Hub

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →