Knowledge Base & Content Management

Text Chunking

Definition

Text chunking is a critical preprocessing step in building an AI knowledge base. When documents are ingested, they must be divided into segments that are small enough for precise retrieval but large enough to contain meaningful context. A chunk is the unit of retrieval — when the AI searches the knowledge base, it retrieves the most relevant chunks, not entire documents. Common chunking strategies include: fixed-size chunking (split every N tokens), sentence-based chunking (split at sentence boundaries), paragraph-based chunking (split at paragraph boundaries), and semantic chunking (split when the topic shifts). Overlap between adjacent chunks helps preserve context at boundaries.

Why It Matters

Chunking strategy is one of the highest-impact decisions in building a RAG (retrieval-augmented generation) system. The wrong chunking approach — either too coarse or too fine — degrades retrieval accuracy and AI answer quality regardless of how good the underlying model is. A chunk that is too small may lack the context needed to answer a question fully. A chunk that is too large returns irrelevant content alongside the relevant portion, confusing the model.

How It Works

During document ingestion, the raw text is passed through a chunking function that applies the chosen strategy. Fixed-size chunking uses a sliding window of N tokens with K tokens of overlap between adjacent chunks. Sentence and paragraph chunking uses NLP tokenizers to identify natural text boundaries. Semantic chunking uses embedding similarity to detect topic shifts. The resulting chunks are stored as individual units in the vector database, each with a reference back to the source document and its position within it.

Text Chunking Strategies Compared

Fixed-Size

Split every N characters

Chunk 1 (500 chars)

Chunk 2 (500 chars)

Chunk 3 (may cut mid-sent...

May break sentences

Sentence-Based

Split at sentence boundaries

Complete sentence 1.

Complete sentence 2.

Complete sentence 3.

Clean boundaries

Semantic / Section

Split by paragraphs or headings

H2: Getting Started (full para)

H2: Configuration (full para)

H2: Troubleshooting

Preserves meaning

Real-World Example

A knowledge base team experiments with chunking sizes for their technical documentation. With 1024-token chunks, the AI retrieves relevant sections but the answers contain irrelevant detail from surrounding content. With 256-token chunks, answers miss context. They settle on 512 tokens with 64-token overlap — precise enough for good retrieval, large enough for complete answers. Retrieval precision improves by 30%.

Common Mistakes

✕Using a single fixed chunk size for all content types — short FAQ answers benefit from small chunks while long procedural guides benefit from larger ones.
✕Ignoring chunk overlap — without overlap, content at chunk boundaries is often split mid-sentence or mid-concept, losing coherence.
✕Not validating chunk quality after processing — inspect sample chunks to ensure they contain coherent, meaningful content rather than fragments.

Related Terms

Document Ingestion

Document ingestion is the process of importing, parsing, and indexing external documents — PDFs, Word files, web pages, CSVs, and more — into a knowledge base or AI retrieval system. It transforms raw files into searchable, retrievable content that an AI can use to answer questions.

Knowledge Base

A knowledge base is a centralized repository of structured information — articles, FAQs, guides, and documentation — that an AI chatbot or support system uses to answer user questions accurately. It is the foundation of any AI-powered self-service experience, directly determining how accurate and comprehensive the bot's answers are.

Semantic Search

Semantic search finds knowledge base articles based on the meaning of a query — not just the words used. By converting both queries and documents into vector embeddings, it identifies conceptually similar content even when users use different terminology than the articles, enabling more natural and accurate information retrieval.

PDF Ingestion

PDF ingestion is the process of extracting text from PDF files and indexing them into a knowledge base. PDFs are the most common document format for product manuals, policies, and technical guides — but extracting clean, structured text from them requires specialized parsing to handle layouts, fonts, columns, and embedded images.

← Knowledge Base & Content Management ← Glossary Hub

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →