Data Connector
Definition
Data connectors are the ingestion layer of RAG architectures, responsible for pulling content from diverse source systems and normalizing it into a unified document format. Each connector handles the specifics of a particular source: authentication (OAuth, API keys), API pagination, content extraction (parsing Confluence wiki markup, extracting Notion blocks, reading Google Docs), deduplication across sources, and incremental sync (tracking what has changed since the last ingestion run). The extracted content is then passed to the chunking and embedding pipeline. Frameworks like LlamaIndex provide 160+ pre-built data connectors (called LlamaHub readers), and LangChain offers document loaders for many common sources.
Why It Matters
Enterprises and SaaS teams rarely store all their knowledge in a single system. Support knowledge spans Confluence wiki pages, Zendesk articles, GitHub READMEs, Notion documentation, Slack threads, and PDF attachments. Without data connectors, engineering teams must write custom ingestion scripts for each source—a significant maintenance burden. For 99helpers customers who want their chatbot to answer from all their internal knowledge sources, data connectors enable a 'connect once, retrieve everywhere' model, dramatically reducing the integration effort needed to build a comprehensive knowledge base.
How It Works
A data connector typically implements a load() or get_documents() interface that returns a list of Document objects, each with a text field (the content) and a metadata field (source URL, last modified date, author, etc.). Connectors handle source-specific concerns: the Confluence connector authenticates via API token, paginates through spaces and pages, and converts wiki markup to clean text. The Google Drive connector uses OAuth to access Drive files, extracts text from Docs and PDFs, and tracks file modification dates for incremental sync. After loading, documents flow into the standard RAG pipeline: chunking, embedding, and vector database upsert.
Data Connector Ecosystem
Data Sources
Connector Pipeline
Each connector handles:
Parsed Document
content: "How to reset..."
source: "notion/faq"
date: "2025-03-01"
author: "support@co"
RAG Pipeline
Chunker → Embedder
Real-World Example
A 99helpers customer wants their AI chatbot to answer questions from three sources: their Zendesk Help Center, a Confluence wiki, and a Google Drive folder of PDF product specs. Using LlamaHub's data connectors, the team configures three connectors with appropriate credentials. Each connector fetches all documents from its source, returns normalized Document objects, and feeds them into the shared chunking and embedding pipeline. When documents update in any source, the connectors' incremental sync detects changes and re-indexes only modified content, keeping the chatbot current without full re-indexing.
Common Mistakes
- ✕Building custom connectors for sources that already have open-source connector libraries, duplicating significant engineering effort.
- ✕Ignoring incremental sync—re-indexing entire sources on every run is expensive and introduces lag for large knowledge bases.
- ✕Storing connector credentials in code instead of using environment variables or a secrets manager.
Related Terms
Document Loader
A document loader is a component that reads raw files from a file system, URL, or API and converts them into a standardized Document object with text content and metadata, serving as the first step in a RAG ingestion pipeline.
Document Ingestion
Document ingestion is the process of importing, parsing, and indexing external documents — PDFs, Word files, web pages, CSVs, and more — into a knowledge base or AI retrieval system. It transforms raw files into searchable, retrievable content that an AI can use to answer questions.
Indexing Pipeline
An indexing pipeline is the offline data processing workflow that transforms raw documents into searchable vector embeddings, running during knowledge base setup and when content is updated.
Retrieval Pipeline
A retrieval pipeline is the online query-time workflow that transforms a user question into a ranked set of relevant document chunks, serving as the information retrieval stage of a RAG system.
PDF Ingestion
PDF ingestion is the process of extracting text from PDF files and indexing them into a knowledge base. PDFs are the most common document format for product manuals, policies, and technical guides — but extracting clean, structured text from them requires specialized parsing to handle layouts, fonts, columns, and embedded images.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →