Retrieval-Augmented Generation (RAG)

Reranking

Definition

Reranking is a two-stage retrieval pattern where a fast first-stage retrieval (dense ANN search, BM25, or hybrid) returns a larger candidate set (top-50 or top-100), and a slower but more accurate second-stage model reorders this set to produce the final top-k results. The reranker uses a cross-encoder architecture that processes the query and each candidate document together — allowing it to model fine-grained query-document interactions that bi-encoder models cannot capture. Reranking dramatically improves retrieval precision at the cost of additional latency, making it suitable when quality is critical and a few hundred extra milliseconds are acceptable.

Why It Matters

Reranking addresses a fundamental limitation of first-stage retrieval: bi-encoder and BM25 models encode queries and documents independently, missing subtle relevance signals that only emerge when they are processed together. A cross-encoder reranker can determine that while two documents are equally similar to the query on the surface, one actually answers the specific question asked and the other only tangentially mentions the topic. This fine-grained relevance assessment improves the quality of context passed to the LLM, which directly improves final answer accuracy in RAG systems.

How It Works

Reranking is implemented by passing the query and each candidate document to a cross-encoder model that outputs a relevance score for each pair. The cross-encoder processes the concatenated [query, document] input through transformer attention layers, enabling each token to attend to every other token — capturing precise relevance signals that bi-encoders miss. Popular reranking models include Cohere Rerank, Jina Reranker, and open-source models like cross-encoder/ms-marco-MiniLM. After scoring all candidates, documents are sorted by reranker score and the top-k are passed to the LLM. The two-stage design (fast retrieval + reranking) balances speed and quality.

Reranking — Two-Stage Retrieval for Precision

Stage 1Fast Retrieval

Vector ANN search — top-20 candidates

1.Doc 07

2.Doc 14

3.Doc 03

4.Doc 18

5.Doc 01

Latency: ~8ms

Stage 2Cross-Encoder Reranking

Score all 20 pairs — reorder by relevance

1.Doc 010.94

2.Doc 030.91

3.Doc 070.87

4.Doc 140.41

5.Doc 180.29

Latency: +45ms

Precision@5 Improvement

Before reranking68%

After reranking87%

Cross-encoder processes query + document together — captures relevance signals bi-encoders miss

Real-World Example

A 99helpers customer adds a cross-encoder reranker to their RAG pipeline. First-stage hybrid retrieval returns top-50 candidates, and the reranker selects the final top-5 to pass as context to the LLM. On their evaluation set, retrieval precision@5 (fraction of the top-5 that are truly relevant) improves from 68% to 87% after adding reranking. The LLM receives better context and generates correct answers on 81% of queries versus 67% before reranking — a 14-point accuracy gain.

Common Mistakes

✕Applying reranking to the entire document corpus rather than a pre-retrieved candidate set — cross-encoders are too slow for corpus-scale scoring; always use first-stage retrieval to get candidates first
✕Using the same model for both retrieval and reranking — the value of the two-stage system depends on the reranker providing complementary signal to the first stage
✕Not measuring whether reranking improves your specific use case — reranking adds latency; only add it if measured improvement justifies the cost

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →

Reranking

Definition

Why It Matters

How It Works

Real-World Example

Common Mistakes

Related Terms

Dense Retrieval

Hybrid Retrieval

Cross-Encoder

Retrieval-Augmented Generation

Retrieval Precision

Ready to build your AI chatbot?