Retrieval-Augmented Generation (RAG)

Cross-Encoder

Definition

A cross-encoder is a neural network architecture used for relevance scoring that takes a query-document pair as joint input — concatenating them with a separator token — and passes the combined input through transformer attention layers. Because all tokens in the query and document can attend to each other, the model can capture nuanced relevance signals: whether the document directly answers the specific question asked, whether it contains supporting evidence, whether it is on-topic but not specifically helpful. Cross-encoders produce more accurate relevance scores than bi-encoders (which encode query and document independently) but require separate inference for each query-document pair, making them computationally expensive for large-scale retrieval.

Why It Matters

Cross-encoders represent the quality ceiling of retrieval models — they consistently produce the most accurate relevance scores when query and document are processed jointly. The architectural difference from bi-encoders is fundamental: joint processing enables the model to detect that a document about 'password reset' is the answer to 'how do I change my password?' even though the exact phrase does not appear. This makes cross-encoders ideal for the reranking stage of RAG pipelines where only 20-100 candidates need scoring. The quality improvement from cross-encoder reranking is often larger than improvements from better first-stage retrieval.

How It Works

Cross-encoders are implemented as sequence-pair classifiers: the query and document are concatenated (format: [CLS] query [SEP] document [SEP]), passed through a pre-trained transformer (BERT, RoBERTa, DeBERTa), and the [CLS] token representation is passed to a linear layer that outputs a relevance score. Cross-encoders are fine-tuned on datasets of query-document pairs labeled with relevance judgments (MS MARCO, NQ, other BEIR benchmarks). For inference in reranking pipelines, the cross-encoder scores each candidate document in parallel batches, and documents are sorted by score. Latency is typically 50-200ms for reranking 50 candidates.

Cross-Encoder Reranking

Query

"How do I export my data?"

Input pair

[Query] How do I export my data?

[Doc 2] You can export data as CSV or JSON from the Settings page.

Model

Cross-enc.

0.94

rank #1

Input pair

[Query] How do I export my data?

[Doc 1] Our platform supports multiple file formats for import operations.

Model

Cross-enc.

0.71

rank #2

Input pair

[Query] How do I export my data?

[Doc 3] Billing invoices can be downloaded from the account portal.

Model

Cross-enc.

0.42

rank #3

Reranked order

1.Doc 2

0.94

2.Doc 1

0.71

3.Doc 3

0.42

Cross-Encoder

–Processes query + doc jointly
–High accuracy — best for reranking
–Slow: O(N) — cannot pre-compute

Bi-Encoder

–Encodes query and doc separately
–Fast: pre-compute doc embeddings
–Lower accuracy — used for recall

Real-World Example

A 99helpers customer tests two reranking approaches: bi-encoder reranking (fast, using the same model as first-stage retrieval to rerank by similarity score) and cross-encoder reranking (slower, using a dedicated cross-encoder). On 200 test queries, bi-encoder reranking improves precision@5 by 4 points over no reranking. Cross-encoder reranking improves precision@5 by 18 points. The additional 120ms latency from cross-encoder scoring is accepted given the significant quality improvement.

Common Mistakes

✕Using cross-encoders for first-stage retrieval over a large corpus — cross-encoders require one inference per document and cannot scale to millions of documents
✕Not fine-tuning the cross-encoder on domain-specific relevance data — general-purpose cross-encoders underperform on specialized domains
✕Reranking too many candidates — scoring 1,000 candidates with a cross-encoder adds seconds of latency; rerank 50-100 candidates for practical latency

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →

Cross-Encoder

Definition

Why It Matters

How It Works

Real-World Example

Common Mistakes

Related Terms

Reranking

Dense Retrieval

Bi-Encoder

Retrieval-Augmented Generation

Retrieval Precision

Ready to build your AI chatbot?