Retrieval-Augmented Generation (RAG)

Late Chunking

Definition

Late chunking, introduced by Jina AI, addresses a fundamental limitation of standard chunking: when documents are split before embedding, each chunk is embedded in isolation, losing the broader document context that neighboring sentences and paragraphs provide. In late chunking, the entire document (or a long passage) is first processed through a long-context embedding model that produces a contextualized token embedding for every token. The document is then split into chunks, but instead of re-embedding each chunk independently, the average of the token embeddings within each chunk's span is used as that chunk's vector. This means each chunk's embedding already incorporates global document context.

Why It Matters

Context is crucial for disambiguation in retrieval. A chunk containing the sentence 'It supports both REST and GraphQL APIs' is almost meaningless without knowing what 'it' refers to. Standard chunk-level embedding must infer this from the chunk alone, often failing for pronoun-heavy or highly contextual documents. Late chunking leverages the full-document context pass to resolve such ambiguities, producing richer chunk embeddings. For 99helpers help documents where product names and feature references are established in introductory paragraphs, late chunking can significantly improve retrieval of mid-document chunks that would otherwise lack context.

How It Works

Late chunking requires a long-context embedding model capable of processing entire documents (e.g., models supporting 8k+ tokens). During indexing, pass the full document through the encoder to obtain per-token embeddings. Apply your chunking strategy (fixed-size, recursive, or semantic) to determine chunk boundaries. For each chunk, aggregate (mean pool) the token embeddings within its boundary span to produce the chunk's vector representation. Store these contextually-enriched vectors in the vector database. At query time, retrieval proceeds normally—the query is embedded and cosine similarity is computed against chunk vectors. The query itself is embedded without document context, so query-document alignment depends on the embedding model's ability to generalize.

Late Chunking vs Early Chunking — Contextual Embedding Quality

Early Chunking

Split first

Divide doc into chunks immediately

Embed independently

Each chunk embedded in isolation

vA

partial

vB

partial

vC

partial

Cross-chunk context is lost — each embedding only knows its own chunk.

Late Chunking

Embed full document

Entire doc passes through encoder

Mean-pool per span

Chunk boundaries applied after

vA

full ctx

vB

full ctx

vC

full ctx

Each chunk embedding inherits full-document context — richer representations.

Context available during embedding

Early: Chunk A alone
30%
Early: Chunk B alone
30%
Late: Chunk A (full-doc)
100%
Late: Chunk B (full-doc)
100%

Why it matters for retrieval

A chunk containing a pronoun like "it" has no context with early chunking. With late chunking, the encoder has seen the full document, so "it" is resolved to the correct entity — the embedding is semantically complete.

Real-World Example

A 99helpers API documentation page opens with: 'The Messages API allows you to create AI chatbot conversations.' Later in the document, a chunk reads: 'It accepts a model parameter specifying which LLM to use.' With standard chunking, 'It' is unresolved in the chunk embedding. With late chunking, the encoder processes the full document and the token embedding for 'It' already encodes 'Messages API' via cross-attention. The resulting chunk vector better matches queries like 'Messages API model parameter,' improving retrieval for this chunk by 35% in internal benchmarks.

Common Mistakes

  • Applying late chunking with a short-context embedding model—the model must be able to process the full document in one pass.
  • Assuming late chunking eliminates the need for chunking strategy decisions—chunk boundaries still affect coherence and retrieval granularity.
  • Ignoring the higher computational cost at indexing time compared to standard chunk-level embedding.

Related Terms

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →
What is Late Chunking? Late Chunking Definition & Guide | 99helpers | 99helpers.com