Retrieval-Augmented Generation (RAG)

Data Connector

Definition

Data connectors are the ingestion layer of RAG architectures, responsible for pulling content from diverse source systems and normalizing it into a unified document format. Each connector handles the specifics of a particular source: authentication (OAuth, API keys), API pagination, content extraction (parsing Confluence wiki markup, extracting Notion blocks, reading Google Docs), deduplication across sources, and incremental sync (tracking what has changed since the last ingestion run). The extracted content is then passed to the chunking and embedding pipeline. Frameworks like LlamaIndex provide 160+ pre-built data connectors (called LlamaHub readers), and LangChain offers document loaders for many common sources.

Why It Matters

Enterprises and SaaS teams rarely store all their knowledge in a single system. Support knowledge spans Confluence wiki pages, Zendesk articles, GitHub READMEs, Notion documentation, Slack threads, and PDF attachments. Without data connectors, engineering teams must write custom ingestion scripts for each source—a significant maintenance burden. For 99helpers customers who want their chatbot to answer from all their internal knowledge sources, data connectors enable a 'connect once, retrieve everywhere' model, dramatically reducing the integration effort needed to build a comprehensive knowledge base.

How It Works

A data connector typically implements a load() or get_documents() interface that returns a list of Document objects, each with a text field (the content) and a metadata field (source URL, last modified date, author, etc.). Connectors handle source-specific concerns: the Confluence connector authenticates via API token, paginates through spaces and pages, and converts wiki markup to clean text. The Google Drive connector uses OAuth to access Drive files, extracts text from Docs and PDFs, and tracks file modification dates for incremental sync. After loading, documents flow into the standard RAG pipeline: chunking, embedding, and vector database upsert.

Data Connector Ecosystem

Data Sources

Notiondocs
Google Drivefiles
Confluencewiki
GitHubcode
Slackmessages
PostgreSQLdb
PDF Filesbinary

Connector Pipeline

Each connector handles:

ExtractAuth, pagination, rate limits
TransformFormat normalization → plain text
LoadStructured Document objects + metadata

Parsed Document

content: "How to reset..."

source: "notion/faq"

date: "2025-03-01"

author: "support@co"

RAG Pipeline

Chunker → Embedder

Real-World Example

A 99helpers customer wants their AI chatbot to answer questions from three sources: their Zendesk Help Center, a Confluence wiki, and a Google Drive folder of PDF product specs. Using LlamaHub's data connectors, the team configures three connectors with appropriate credentials. Each connector fetches all documents from its source, returns normalized Document objects, and feeds them into the shared chunking and embedding pipeline. When documents update in any source, the connectors' incremental sync detects changes and re-indexes only modified content, keeping the chatbot current without full re-indexing.

Common Mistakes

  • Building custom connectors for sources that already have open-source connector libraries, duplicating significant engineering effort.
  • Ignoring incremental sync—re-indexing entire sources on every run is expensive and introduces lag for large knowledge bases.
  • Storing connector credentials in code instead of using environment variables or a secrets manager.

Related Terms

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →
What is Data Connector? Data Connector Definition & Guide | 99helpers | 99helpers.com