Retrieval-Augmented Generation (RAG)

Semantic Chunking

Definition

Semantic chunking is a document segmentation strategy that uses embedding similarity to detect topical boundaries in text. Instead of splitting every 512 tokens regardless of content, semantic chunking compares the embedding of each sentence with its neighbors and inserts a split when the similarity drops below a threshold—indicating a topic shift. The resulting chunks contain semantically cohesive content, making each chunk more likely to match queries about that specific topic. This contrasts with fixed-size chunking, which can sever mid-sentence or split a single concept across two chunks, and with recursive chunking, which splits on structural markers like paragraph breaks.

Why It Matters

The quality of chunks directly determines the quality of retrieval and generation. Fixed-size chunks are fast to implement but frequently produce incomplete thoughts or merge unrelated content, causing the retriever to surface partially relevant passages. Semantic chunking ensures that when a chunk is retrieved, it contains a complete, coherent discussion of one topic—giving the LLM a better foundation for generating accurate answers. For 99helpers knowledge bases containing long, multi-topic help articles, semantic chunking can significantly improve answer quality by preventing topic contamination between chunks.

How It Works

To implement semantic chunking, first split the document into individual sentences. Compute an embedding for each sentence, then compute the cosine similarity between consecutive sentences or a sliding window of sentences. When similarity drops below a configurable threshold (e.g., 0.7), mark a chunk boundary. Collect sentences within each boundary into a single chunk. Libraries like LangChain and LlamaIndex provide semantic chunking implementations. The threshold is a hyperparameter—lower values create fewer, larger chunks; higher values create more, smaller chunks—and should be tuned against retrieval quality metrics on your specific corpus.

Semantic Chunking — Similarity-Based Boundary Detection

Document sentences

S1Initial setup requires installing the SDK.Chunk 1

S2Configure environment variables in your .env file.Chunk 1

S3Run the initialization script to complete setup.Chunk 1

Chunk boundary (sim=0.31)

S4Advanced rate limiting lets you control API usage.Chunk 2

S5Set throttle limits per user or per endpoint.Chunk 2

Chunk boundary (sim=0.28)

S6Common errors include 401 unauthorized responses.Chunk 3

S7Retry with exponential backoff after 429 errors.Chunk 3

Cosine similarity between consecutive sentences

S1 → S2

0.91

S2 → S3

0.88

S3 → S4

0.31CUT

S4 → S5

0.85

S5 → S6

0.28CUT

S6 → S7

0.87

Threshold: 0.60 — below = chunk boundary

Chunk 1

Setup

3 sentences

Chunk 2

Rate limiting

2 sentences

Chunk 3

Error handling

2 sentences

Real-World Example

A 99helpers help article covers three topics: initial setup, advanced configuration, and troubleshooting. With fixed-size chunking at 400 tokens, the article splits mid-paragraph, producing one chunk that mixes setup and configuration content and another that spans configuration and troubleshooting. With semantic chunking, the embedding similarity drops at the natural topic transitions, producing three coherent chunks. Retrieval tests show that queries about 'troubleshooting' now reliably surface only the troubleshooting chunk rather than mixed-content chunks, improving answer precision by 28%.

Common Mistakes

✕Choosing a similarity threshold without testing it—default values may over-split or under-split your specific document style.
✕Applying semantic chunking to short documents where it adds overhead without meaningfully improving over paragraph-based chunking.
✕Ignoring chunk size distribution—semantic chunks can vary widely in length, and very long chunks may overflow the LLM context window.

Related Terms

Document Chunking

Document chunking is the process of splitting large documents into smaller text segments before embedding and indexing for RAG, balancing chunk size to preserve context while staying within embedding model limits and enabling precise retrieval.

Chunk Size

Chunk size is the maximum number of tokens or characters in each document segment created during the chunking phase of RAG indexing, controlling the granularity of retrieval and the amount of context available per retrieved chunk.

Chunk Overlap

Chunk overlap is a chunking strategy where consecutive document chunks share a portion of overlapping text, ensuring that information spanning chunk boundaries is captured in at least one complete chunk.

Sliding Window Chunking

Sliding window chunking splits documents into overlapping segments by advancing a fixed-size window across the text. Overlap between consecutive chunks ensures that information near chunk boundaries is captured in multiple chunks, reducing information loss.

Recursive Chunking

Recursive chunking splits documents hierarchically using a priority list of separators—first by double newlines, then single newlines, then sentences, then words—ensuring chunks respect natural structural boundaries before falling back to finer splits.

← Retrieval-Augmented Generation (RAG)← Glossary Hub

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →