Retrieval-Augmented Generation (RAG)

Bi-Encoder

Definition

A bi-encoder (also called a dual encoder or twin tower model) is a neural architecture that uses two separate encoder networks (often with shared weights) to independently encode a query and a document into embedding vectors, then computes similarity by comparing the vectors. The key property is independence: the query and document are encoded separately without any cross-attention between them. This independence enables offline pre-computation of document embeddings — documents are encoded once and stored; only query embeddings are computed in real time. The trade-off is that bi-encoders cannot model fine-grained query-document interactions as accurately as cross-encoders, but they scale to billions of documents.

Why It Matters

Bi-encoders are the practical architecture that makes large-scale semantic search possible. The ability to pre-compute and cache document embeddings means retrieval over millions of documents requires only one inference operation at query time (the query embedding), followed by ANN search over pre-computed document vectors. This architecture is why vector databases can answer queries in milliseconds. Understanding the bi-encoder vs. cross-encoder tradeoff is essential for RAG system design: bi-encoders for scalable first-stage retrieval, cross-encoders for accurate second-stage reranking.

How It Works

Bi-encoders are trained with contrastive learning objectives: given a query, the model learns to produce embeddings where the correct document's embedding is close to the query embedding while incorrect documents are far away. Training data consists of query-positive document pairs (from MS MARCO, NQ, or custom datasets). At inference time, document embeddings are computed by passing each document through the encoder and storing the resulting vector. Query embeddings are computed in real time when a query arrives. The cosine similarity between query and document vectors provides the retrieval score. Popular bi-encoder implementations include the sentence-transformers library and commercial embedding APIs.

Bi-Encoder Architecture

Bi-Encoder (Independent)

Query

"reset password"

Encoder A

Shared or separate weights

Query Vector

[0.23, -0.81, 0.45...]

Document

"How to reset your password..."

Encoder B

Pre-computable offline

Doc Vector

[0.21, -0.79, 0.48...]

Cosine Similarity

score = 0.97

Key advantage: offline indexing

Index time

Embed all docs once → store vectors

Query time

Embed query only → search index

Bi-Encoder

–Encodes query + doc separately
–Fast: O(1) per doc at query time
–Good for large-scale retrieval

Cross-Encoder

–Processes query + doc together
–Slow: O(N) at query time
–Higher accuracy — used for reranking

Real-World Example

A 99helpers customer building a custom RAG system chooses a bi-encoder/cross-encoder architecture. They use a general-purpose bi-encoder (OpenAI text-embedding-3-small) for fast first-stage retrieval over 25,000 chunks, returning top-20 candidates in 15ms. A lightweight cross-encoder reranker scores these 20 candidates in 80ms. Total retrieval latency: 95ms. By splitting the work between a fast bi-encoder and an accurate cross-encoder, they achieve both production-scale speed and high retrieval precision.

Common Mistakes

✕Confusing bi-encoders with cross-encoders — bi-encoders produce independent vectors enabling fast search; cross-encoders process pairs enabling accurate scoring; they serve different roles in the retrieval pipeline
✕Fine-tuning only the cross-encoder reranker without fine-tuning the bi-encoder — if the bi-encoder misses relevant documents in first-stage retrieval, the reranker cannot recover them
✕Using bi-encoders for final relevance scoring — bi-encoders are for scalable retrieval, not final precision scoring; use cross-encoders for the final relevance judgment

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →

Bi-Encoder

Definition

Why It Matters

How It Works

Real-World Example

Common Mistakes

Related Terms

Cross-Encoder

Embedding Model

Dense Retrieval

Reranking

Approximate Nearest Neighbor

Ready to build your AI chatbot?