Late Chunking
Definition
Late chunking, introduced by Jina AI, addresses a fundamental limitation of standard chunking: when documents are split before embedding, each chunk is embedded in isolation, losing the broader document context that neighboring sentences and paragraphs provide. In late chunking, the entire document (or a long passage) is first processed through a long-context embedding model that produces a contextualized token embedding for every token. The document is then split into chunks, but instead of re-embedding each chunk independently, the average of the token embeddings within each chunk's span is used as that chunk's vector. This means each chunk's embedding already incorporates global document context.
Why It Matters
Context is crucial for disambiguation in retrieval. A chunk containing the sentence 'It supports both REST and GraphQL APIs' is almost meaningless without knowing what 'it' refers to. Standard chunk-level embedding must infer this from the chunk alone, often failing for pronoun-heavy or highly contextual documents. Late chunking leverages the full-document context pass to resolve such ambiguities, producing richer chunk embeddings. For 99helpers help documents where product names and feature references are established in introductory paragraphs, late chunking can significantly improve retrieval of mid-document chunks that would otherwise lack context.
How It Works
Late chunking requires a long-context embedding model capable of processing entire documents (e.g., models supporting 8k+ tokens). During indexing, pass the full document through the encoder to obtain per-token embeddings. Apply your chunking strategy (fixed-size, recursive, or semantic) to determine chunk boundaries. For each chunk, aggregate (mean pool) the token embeddings within its boundary span to produce the chunk's vector representation. Store these contextually-enriched vectors in the vector database. At query time, retrieval proceeds normally—the query is embedded and cosine similarity is computed against chunk vectors. The query itself is embedded without document context, so query-document alignment depends on the embedding model's ability to generalize.
Late Chunking vs Early Chunking — Contextual Embedding Quality
Early Chunking
Split first
Divide doc into chunks immediately
Embed independently
Each chunk embedded in isolation
vA
partial
vB
partial
vC
partial
Cross-chunk context is lost — each embedding only knows its own chunk.
Late Chunking
Embed full document
Entire doc passes through encoder
Mean-pool per span
Chunk boundaries applied after
vA
full ctx
vB
full ctx
vC
full ctx
Each chunk embedding inherits full-document context — richer representations.
Context available during embedding
Why it matters for retrieval
A chunk containing a pronoun like "it" has no context with early chunking. With late chunking, the encoder has seen the full document, so "it" is resolved to the correct entity — the embedding is semantically complete.
Real-World Example
A 99helpers API documentation page opens with: 'The Messages API allows you to create AI chatbot conversations.' Later in the document, a chunk reads: 'It accepts a model parameter specifying which LLM to use.' With standard chunking, 'It' is unresolved in the chunk embedding. With late chunking, the encoder processes the full document and the token embedding for 'It' already encodes 'Messages API' via cross-attention. The resulting chunk vector better matches queries like 'Messages API model parameter,' improving retrieval for this chunk by 35% in internal benchmarks.
Common Mistakes
- ✕Applying late chunking with a short-context embedding model—the model must be able to process the full document in one pass.
- ✕Assuming late chunking eliminates the need for chunking strategy decisions—chunk boundaries still affect coherence and retrieval granularity.
- ✕Ignoring the higher computational cost at indexing time compared to standard chunk-level embedding.
Related Terms
Document Chunking
Document chunking is the process of splitting large documents into smaller text segments before embedding and indexing for RAG, balancing chunk size to preserve context while staying within embedding model limits and enabling precise retrieval.
Embedding Model
An embedding model is a machine learning model that converts text (or other data) into dense numerical vectors that capture semantic meaning, enabling similarity search and serving as the foundation of RAG retrieval systems.
Semantic Chunking
Semantic chunking splits documents into segments based on meaning boundaries—grouping sentences that discuss the same topic together—rather than fixed character counts. This produces more coherent, self-contained chunks that improve retrieval quality.
Document Embedding
Document embedding is the process of converting text documents into numerical vector representations that capture their semantic meaning, enabling AI systems to find conceptually similar content through vector similarity search.
Context Window
A context window is the maximum amount of text (measured in tokens) that a language model can process in a single inference call, determining how much retrieved content, conversation history, and instructions can be included in a RAG prompt.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →