Bi-Encoder
Definition
A bi-encoder (also called a dual encoder or twin tower model) is a neural architecture that uses two separate encoder networks (often with shared weights) to independently encode a query and a document into embedding vectors, then computes similarity by comparing the vectors. The key property is independence: the query and document are encoded separately without any cross-attention between them. This independence enables offline pre-computation of document embeddings — documents are encoded once and stored; only query embeddings are computed in real time. The trade-off is that bi-encoders cannot model fine-grained query-document interactions as accurately as cross-encoders, but they scale to billions of documents.
Why It Matters
Bi-encoders are the practical architecture that makes large-scale semantic search possible. The ability to pre-compute and cache document embeddings means retrieval over millions of documents requires only one inference operation at query time (the query embedding), followed by ANN search over pre-computed document vectors. This architecture is why vector databases can answer queries in milliseconds. Understanding the bi-encoder vs. cross-encoder tradeoff is essential for RAG system design: bi-encoders for scalable first-stage retrieval, cross-encoders for accurate second-stage reranking.
How It Works
Bi-encoders are trained with contrastive learning objectives: given a query, the model learns to produce embeddings where the correct document's embedding is close to the query embedding while incorrect documents are far away. Training data consists of query-positive document pairs (from MS MARCO, NQ, or custom datasets). At inference time, document embeddings are computed by passing each document through the encoder and storing the resulting vector. Query embeddings are computed in real time when a query arrives. The cosine similarity between query and document vectors provides the retrieval score. Popular bi-encoder implementations include the sentence-transformers library and commercial embedding APIs.
Bi-Encoder Architecture
Bi-Encoder (Independent)
Query
"reset password"
Encoder A
Shared or separate weights
Query Vector
[0.23, -0.81, 0.45...]
Document
"How to reset your password..."
Encoder B
Pre-computable offline
Doc Vector
[0.21, -0.79, 0.48...]
Cosine Similarity
score = 0.97
Key advantage: offline indexing
Index time
Embed all docs once → store vectors
Query time
Embed query only → search index
Bi-Encoder
- –Encodes query + doc separately
- –Fast: O(1) per doc at query time
- –Good for large-scale retrieval
Cross-Encoder
- –Processes query + doc together
- –Slow: O(N) at query time
- –Higher accuracy — used for reranking
Real-World Example
A 99helpers customer building a custom RAG system chooses a bi-encoder/cross-encoder architecture. They use a general-purpose bi-encoder (OpenAI text-embedding-3-small) for fast first-stage retrieval over 25,000 chunks, returning top-20 candidates in 15ms. A lightweight cross-encoder reranker scores these 20 candidates in 80ms. Total retrieval latency: 95ms. By splitting the work between a fast bi-encoder and an accurate cross-encoder, they achieve both production-scale speed and high retrieval precision.
Common Mistakes
- ✕Confusing bi-encoders with cross-encoders — bi-encoders produce independent vectors enabling fast search; cross-encoders process pairs enabling accurate scoring; they serve different roles in the retrieval pipeline
- ✕Fine-tuning only the cross-encoder reranker without fine-tuning the bi-encoder — if the bi-encoder misses relevant documents in first-stage retrieval, the reranker cannot recover them
- ✕Using bi-encoders for final relevance scoring — bi-encoders are for scalable retrieval, not final precision scoring; use cross-encoders for the final relevance judgment
Related Terms
Cross-Encoder
A cross-encoder is a transformer model that processes a query and a document together in a single forward pass, producing a relevance score that captures fine-grained query-document interactions for high-quality reranking.
Embedding Model
An embedding model is a machine learning model that converts text (or other data) into dense numerical vectors that capture semantic meaning, enabling similarity search and serving as the foundation of RAG retrieval systems.
Dense Retrieval
Dense retrieval is a retrieval approach that encodes both queries and documents into dense embedding vectors and finds relevant documents by computing vector similarity, enabling semantic matching beyond exact keyword overlap.
Reranking
Reranking is a second-stage retrieval step that takes an initial set of candidate documents returned by a fast retrieval method and reorders them using a more accurate but computationally expensive model to improve final result quality.
Approximate Nearest Neighbor
Approximate Nearest Neighbor (ANN) search finds vectors that are close to a query vector with high probability but without guaranteeing exactness, enabling fast similarity search across millions of vectors at the cost of small accuracy tradeoffs.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →