Retrieval-Augmented Generation (RAG)

Document Loader

Definition

Document loaders are the file-reading layer in RAG frameworks, responsible for opening and parsing individual files or API responses and producing a normalized Document representation. Unlike data connectors that handle authentication and pagination with external services, document loaders focus on parsing specific file formats: PDFs (using pdfplumber, PyMuPDF, or pdfminer), Word documents (python-docx), HTML pages (BeautifulSoup), Markdown files, CSV and Excel spreadsheets, PowerPoint slides, and plain text. Each loader handles the format-specific extraction logic and produces a Document object containing the extracted text and metadata such as source path, page number, and creation date.

Why It Matters

The quality of document loading directly determines what information enters the RAG pipeline. A poor PDF loader that ignores headers, misreads tables, or fails on scanned pages will produce garbled text that embeds poorly and retrieves incorrectly. For 99helpers customers ingesting technical documentation, API references, and product specification PDFs, choosing the right loader and parsing strategy is as important as the retrieval and generation steps. Investing in high-quality loading—including table extraction, header preservation, and layout-aware parsing—pays dividends in answer quality throughout the entire pipeline.

How It Works

Document loaders in LangChain implement a load() method returning a list of Document objects. For a PDF, the PyPDFLoader reads each page, extracts text using PyPDF2, and returns one Document per page with metadata including page_number and source filename. For HTML, the WebBaseLoader fetches a URL and uses BeautifulSoup to extract readable text, stripping navigation, ads, and scripts. For CSV files, a CSVLoader reads each row as a separate Document, treating column headers as metadata keys. After loading, documents pass to the chunker and embedder. Choosing the right loader for each file type—and handling loading errors gracefully—is a critical part of production pipeline reliability.

Document Loading Pipeline

Source files

PDF

DOCX

HTML

CSV

JSON

Document Loader

Format parsing

PDF → text, HTML → stripped, CSV → rows

Metadata extraction

source, page number, date, author

Normalization

Encoding fix, whitespace cleanup, dedupe

Parsed Document objects

content: "Our Q1 revenue grew by..."

source: "report.pdf"

page: 1

date: "2025-03-01"

content: "Product adoption metrics show..."

source: "report.pdf"

page: 2

date: "2025-03-01"

Chunker

Split text

Embedder

Dense vectors

Vector Store

Indexed

Real-World Example

A 99helpers customer uploads 200 PDF product manuals to their knowledge base. Using LangChain's UnstructuredPDFLoader, which combines pdfminer for text extraction with computer vision for table and header detection, the pipeline extracts clean, well-structured text including table contents as formatted strings. When users ask questions about specifications in tables (e.g., 'What is the maximum file size for upload?'), the loader's table extraction ensures this data is available in the embedding. A simpler loader that treated tables as blank regions would miss these specifications entirely.

Common Mistakes

✕Using a basic text-extraction PDF loader for scanned or image-heavy PDFs—OCR-capable loaders like Unstructured are needed for these.
✕Loading large files as single documents without chunking—a 200-page PDF loaded as one document exceeds embedding model token limits.
✕Ignoring loader errors silently—a single corrupt file that crashes the loader can halt the entire ingestion run.

Related Terms

Data Connector

A data connector in RAG systems is an integration component that ingests content from a specific external source—such as Confluence, Notion, Google Drive, or Zendesk—and transforms it into a format suitable for embedding and storage in a vector database.

Document Ingestion

Document ingestion is the process of importing, parsing, and indexing external documents — PDFs, Word files, web pages, CSVs, and more — into a knowledge base or AI retrieval system. It transforms raw files into searchable, retrievable content that an AI can use to answer questions.

Document Parsing

Document parsing is the extraction of structured or clean text content from various file formats — PDF, DOCX, HTML, CSV, PPTX, and more — as part of a knowledge base ingestion pipeline. A robust parser handles format-specific complexities and produces clean, well-structured text ready for chunking and indexing.

PDF Ingestion

PDF ingestion is the process of extracting text from PDF files and indexing them into a knowledge base. PDFs are the most common document format for product manuals, policies, and technical guides — but extracting clean, structured text from them requires specialized parsing to handle layouts, fonts, columns, and embedded images.

Indexing Pipeline

An indexing pipeline is the offline data processing workflow that transforms raw documents into searchable vector embeddings, running during knowledge base setup and when content is updated.

← Retrieval-Augmented Generation (RAG)← Glossary Hub

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →