Reranking
Definition
Reranking is a two-stage retrieval pattern where a fast first-stage retrieval (dense ANN search, BM25, or hybrid) returns a larger candidate set (top-50 or top-100), and a slower but more accurate second-stage model reorders this set to produce the final top-k results. The reranker uses a cross-encoder architecture that processes the query and each candidate document together — allowing it to model fine-grained query-document interactions that bi-encoder models cannot capture. Reranking dramatically improves retrieval precision at the cost of additional latency, making it suitable when quality is critical and a few hundred extra milliseconds are acceptable.
Why It Matters
Reranking addresses a fundamental limitation of first-stage retrieval: bi-encoder and BM25 models encode queries and documents independently, missing subtle relevance signals that only emerge when they are processed together. A cross-encoder reranker can determine that while two documents are equally similar to the query on the surface, one actually answers the specific question asked and the other only tangentially mentions the topic. This fine-grained relevance assessment improves the quality of context passed to the LLM, which directly improves final answer accuracy in RAG systems.
How It Works
Reranking is implemented by passing the query and each candidate document to a cross-encoder model that outputs a relevance score for each pair. The cross-encoder processes the concatenated [query, document] input through transformer attention layers, enabling each token to attend to every other token — capturing precise relevance signals that bi-encoders miss. Popular reranking models include Cohere Rerank, Jina Reranker, and open-source models like cross-encoder/ms-marco-MiniLM. After scoring all candidates, documents are sorted by reranker score and the top-k are passed to the LLM. The two-stage design (fast retrieval + reranking) balances speed and quality.
Reranking — Two-Stage Retrieval for Precision
Vector ANN search — top-20 candidates
Latency: ~8ms
Score all 20 pairs — reorder by relevance
Latency: +45ms
Precision@5 Improvement
Real-World Example
A 99helpers customer adds a cross-encoder reranker to their RAG pipeline. First-stage hybrid retrieval returns top-50 candidates, and the reranker selects the final top-5 to pass as context to the LLM. On their evaluation set, retrieval precision@5 (fraction of the top-5 that are truly relevant) improves from 68% to 87% after adding reranking. The LLM receives better context and generates correct answers on 81% of queries versus 67% before reranking — a 14-point accuracy gain.
Common Mistakes
- ✕Applying reranking to the entire document corpus rather than a pre-retrieved candidate set — cross-encoders are too slow for corpus-scale scoring; always use first-stage retrieval to get candidates first
- ✕Using the same model for both retrieval and reranking — the value of the two-stage system depends on the reranker providing complementary signal to the first stage
- ✕Not measuring whether reranking improves your specific use case — reranking adds latency; only add it if measured improvement justifies the cost
Related Terms
Dense Retrieval
Dense retrieval is a retrieval approach that encodes both queries and documents into dense embedding vectors and finds relevant documents by computing vector similarity, enabling semantic matching beyond exact keyword overlap.
Hybrid Retrieval
Hybrid retrieval combines dense (semantic) and sparse (keyword) search methods to leverage the strengths of both, using a fusion step to merge their results into a single ranked list for better overall retrieval quality.
Cross-Encoder
A cross-encoder is a transformer model that processes a query and a document together in a single forward pass, producing a relevance score that captures fine-grained query-document interactions for high-quality reranking.
Retrieval-Augmented Generation
Retrieval-Augmented Generation (RAG) is an AI architecture that enhances large language model responses by first retrieving relevant documents from an external knowledge base and then using that retrieved content as context when generating an answer.
Retrieval Precision
Retrieval precision measures the fraction of retrieved documents that are actually relevant to the query. In RAG systems, high precision means the context passed to the LLM contains mostly useful information rather than noise.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →