Retrieval-Augmented Generation (RAG)

Late Chunking

Definition

Late chunking, introduced by Jina AI, addresses a fundamental limitation of standard chunking: when documents are split before embedding, each chunk is embedded in isolation, losing the broader document context that neighboring sentences and paragraphs provide. In late chunking, the entire document (or a long passage) is first processed through a long-context embedding model that produces a contextualized token embedding for every token. The document is then split into chunks, but instead of re-embedding each chunk independently, the average of the token embeddings within each chunk's span is used as that chunk's vector. This means each chunk's embedding already incorporates global document context.

Why It Matters

Context is crucial for disambiguation in retrieval. A chunk containing the sentence 'It supports both REST and GraphQL APIs' is almost meaningless without knowing what 'it' refers to. Standard chunk-level embedding must infer this from the chunk alone, often failing for pronoun-heavy or highly contextual documents. Late chunking leverages the full-document context pass to resolve such ambiguities, producing richer chunk embeddings. For 99helpers help documents where product names and feature references are established in introductory paragraphs, late chunking can significantly improve retrieval of mid-document chunks that would otherwise lack context.

How It Works

Late chunking requires a long-context embedding model capable of processing entire documents (e.g., models supporting 8k+ tokens). During indexing, pass the full document through the encoder to obtain per-token embeddings. Apply your chunking strategy (fixed-size, recursive, or semantic) to determine chunk boundaries. For each chunk, aggregate (mean pool) the token embeddings within its boundary span to produce the chunk's vector representation. Store these contextually-enriched vectors in the vector database. At query time, retrieval proceeds normally—the query is embedded and cosine similarity is computed against chunk vectors. The query itself is embedded without document context, so query-document alignment depends on the embedding model's ability to generalize.

Late Chunking vs Early Chunking — Contextual Embedding Quality

Early Chunking

Split first

Divide doc into chunks immediately

Embed independently

Each chunk embedded in isolation

partial

Cross-chunk context is lost — each embedding only knows its own chunk.

Late Chunking

Embed full document

Entire doc passes through encoder

Mean-pool per span

Chunk boundaries applied after

full ctx

Each chunk embedding inherits full-document context — richer representations.

Context available during embedding

Early: Chunk A alone

30%

Early: Chunk B alone

30%

Late: Chunk A (full-doc)

100%

Late: Chunk B (full-doc)

100%

Why it matters for retrieval

A chunk containing a pronoun like "it" has no context with early chunking. With late chunking, the encoder has seen the full document, so "it" is resolved to the correct entity — the embedding is semantically complete.

Real-World Example

A 99helpers API documentation page opens with: 'The Messages API allows you to create AI chatbot conversations.' Later in the document, a chunk reads: 'It accepts a model parameter specifying which LLM to use.' With standard chunking, 'It' is unresolved in the chunk embedding. With late chunking, the encoder processes the full document and the token embedding for 'It' already encodes 'Messages API' via cross-attention. The resulting chunk vector better matches queries like 'Messages API model parameter,' improving retrieval for this chunk by 35% in internal benchmarks.

Common Mistakes

✕Applying late chunking with a short-context embedding model—the model must be able to process the full document in one pass.
✕Assuming late chunking eliminates the need for chunking strategy decisions—chunk boundaries still affect coherence and retrieval granularity.
✕Ignoring the higher computational cost at indexing time compared to standard chunk-level embedding.

Related Terms

Document Chunking

Document chunking is the process of splitting large documents into smaller text segments before embedding and indexing for RAG, balancing chunk size to preserve context while staying within embedding model limits and enabling precise retrieval.

Embedding Model

An embedding model is a machine learning model that converts text (or other data) into dense numerical vectors that capture semantic meaning, enabling similarity search and serving as the foundation of RAG retrieval systems.

Semantic Chunking

Semantic chunking splits documents into segments based on meaning boundaries—grouping sentences that discuss the same topic together—rather than fixed character counts. This produces more coherent, self-contained chunks that improve retrieval quality.

Document Embedding

Document embedding is the process of converting text documents into numerical vector representations that capture their semantic meaning, enabling AI systems to find conceptually similar content through vector similarity search.

Context Window

A context window is the maximum amount of text (measured in tokens) that a language model can process in a single inference call, determining how much retrieved content, conversation history, and instructions can be included in a RAG prompt.

← Retrieval-Augmented Generation (RAG)← Glossary Hub

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →