Cross-Encoder
Definition
A cross-encoder is a neural network architecture used for relevance scoring that takes a query-document pair as joint input — concatenating them with a separator token — and passes the combined input through transformer attention layers. Because all tokens in the query and document can attend to each other, the model can capture nuanced relevance signals: whether the document directly answers the specific question asked, whether it contains supporting evidence, whether it is on-topic but not specifically helpful. Cross-encoders produce more accurate relevance scores than bi-encoders (which encode query and document independently) but require separate inference for each query-document pair, making them computationally expensive for large-scale retrieval.
Why It Matters
Cross-encoders represent the quality ceiling of retrieval models — they consistently produce the most accurate relevance scores when query and document are processed jointly. The architectural difference from bi-encoders is fundamental: joint processing enables the model to detect that a document about 'password reset' is the answer to 'how do I change my password?' even though the exact phrase does not appear. This makes cross-encoders ideal for the reranking stage of RAG pipelines where only 20-100 candidates need scoring. The quality improvement from cross-encoder reranking is often larger than improvements from better first-stage retrieval.
How It Works
Cross-encoders are implemented as sequence-pair classifiers: the query and document are concatenated (format: [CLS] query [SEP] document [SEP]), passed through a pre-trained transformer (BERT, RoBERTa, DeBERTa), and the [CLS] token representation is passed to a linear layer that outputs a relevance score. Cross-encoders are fine-tuned on datasets of query-document pairs labeled with relevance judgments (MS MARCO, NQ, other BEIR benchmarks). For inference in reranking pipelines, the cross-encoder scores each candidate document in parallel batches, and documents are sorted by score. Latency is typically 50-200ms for reranking 50 candidates.
Cross-Encoder Reranking
Query
"How do I export my data?"
Input pair
[Query] How do I export my data?
[Doc 2] You can export data as CSV or JSON from the Settings page.
Model
Cross-enc.
0.94
rank #1
Input pair
[Query] How do I export my data?
[Doc 1] Our platform supports multiple file formats for import operations.
Model
Cross-enc.
0.71
rank #2
Input pair
[Query] How do I export my data?
[Doc 3] Billing invoices can be downloaded from the account portal.
Model
Cross-enc.
0.42
rank #3
Reranked order
Cross-Encoder
- –Processes query + doc jointly
- –High accuracy — best for reranking
- –Slow: O(N) — cannot pre-compute
Bi-Encoder
- –Encodes query and doc separately
- –Fast: pre-compute doc embeddings
- –Lower accuracy — used for recall
Real-World Example
A 99helpers customer tests two reranking approaches: bi-encoder reranking (fast, using the same model as first-stage retrieval to rerank by similarity score) and cross-encoder reranking (slower, using a dedicated cross-encoder). On 200 test queries, bi-encoder reranking improves precision@5 by 4 points over no reranking. Cross-encoder reranking improves precision@5 by 18 points. The additional 120ms latency from cross-encoder scoring is accepted given the significant quality improvement.
Common Mistakes
- ✕Using cross-encoders for first-stage retrieval over a large corpus — cross-encoders require one inference per document and cannot scale to millions of documents
- ✕Not fine-tuning the cross-encoder on domain-specific relevance data — general-purpose cross-encoders underperform on specialized domains
- ✕Reranking too many candidates — scoring 1,000 candidates with a cross-encoder adds seconds of latency; rerank 50-100 candidates for practical latency
Related Terms
Reranking
Reranking is a second-stage retrieval step that takes an initial set of candidate documents returned by a fast retrieval method and reorders them using a more accurate but computationally expensive model to improve final result quality.
Dense Retrieval
Dense retrieval is a retrieval approach that encodes both queries and documents into dense embedding vectors and finds relevant documents by computing vector similarity, enabling semantic matching beyond exact keyword overlap.
Bi-Encoder
A bi-encoder is a neural network architecture that independently encodes queries and documents into separate embedding vectors, enabling fast offline document indexing and real-time similarity search for scalable retrieval.
Retrieval-Augmented Generation
Retrieval-Augmented Generation (RAG) is an AI architecture that enhances large language model responses by first retrieving relevant documents from an external knowledge base and then using that retrieved content as context when generating an answer.
Retrieval Precision
Retrieval precision measures the fraction of retrieved documents that are actually relevant to the query. In RAG systems, high precision means the context passed to the LLM contains mostly useful information rather than noise.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →