Retrieval-Augmented Generation (RAG)

BM25

Definition

BM25 (Best Match 25) is a probabilistic ranking function for information retrieval that scores how relevant a document is to a query by computing a weighted sum of scores for each query term found in the document. The scoring formula rewards: term frequency saturation (the first occurrence of a query term contributes more than subsequent ones, preventing documents from gaming the score by repeating terms), inverse document frequency (rare terms across the corpus contribute more to relevance than common terms), and length normalization (a query term appearing in a short document signals more relevance than the same term in a long document). BM25 is the standard baseline for keyword search in Elasticsearch, Solr, Lucene, and hybrid vector search systems.

Why It Matters

BM25 remains one of the most effective retrieval algorithms despite being over 30 years old. Its continued relevance stems from being excellently suited to a common real-world pattern: users querying with specific technical terms, product names, version numbers, and error codes that should be matched exactly. Dense semantic search cannot reliably match these because it treats all tokens as semantic units — the embedding for 'NullPointerException' may be less distinctive than a keyword match. Modern RAG best practices treat BM25 as the indispensable sparse component in hybrid retrieval systems.

How It Works

BM25 scoring for a query with terms q1, q2, ..., qn against a document d is: score(d,Q) = sum over query terms of [IDF(qi) × (tf(qi,d) × (k1+1)) / (tf(qi,d) + k1 × (1-b+b×|d|/avgdl))]. Parameters: k1 controls term frequency saturation (typically 1.2-2.0), b controls length normalization (typically 0.75), |d| is document length, avgdl is average document length. In practice, BM25 is implemented in Elasticsearch (or OpenSearch) for the sparse retrieval component of hybrid RAG systems, or in lightweight Python libraries like rank_bm25 for smaller collections.

BM25 Scoring — Query: "reset password"

Score components

Term frequency in doc

IDF

Rarity across corpus

Length norm

Penalizes long docs

1Doc 1

4.2

Reset password by clicking the reset link.

TF: 3Doc length: Short

3Doc 2

1.8

Our platform provides account management including password changes, profile updates, billing, notifications, and security settings.

TF: 1Doc length: Long

2Doc 3

3.1

To reset your password, visit account settings and confirm your email.

TF: 2Doc length: Medium

Ranked Results

1.Doc 1

4.2

2.Doc 3

3.1

3.Doc 2

1.8

Real-World Example

A 99helpers customer building their RAG evaluation discovers that 23% of their most common customer queries include specific product identifiers, version numbers, or error codes. For these queries, dense retrieval recall@5 is only 61% because the embedding model does not strongly differentiate specific identifiers. After adding BM25 as the sparse component in a hybrid system, recall@5 for identifier queries improves to 94%. Overall hybrid system recall@5 across all query types improves from 82% to 91%.

Common Mistakes

✕Treating BM25 as inferior to dense retrieval — BM25 is state-of-the-art for many query types and an essential component of production RAG systems
✕Not tuning BM25 parameters (k1, b) for your document collection — default parameters may not be optimal for your specific document length distribution and query style
✕Applying BM25 to raw text without preprocessing — apply tokenization, lowercasing, stop word removal, and stemming to improve BM25 retrieval quality

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →

BM25

Definition

Why It Matters

How It Works

Real-World Example

Common Mistakes

Related Terms

Sparse Retrieval

Hybrid Retrieval

TF-IDF

Inverted Index

Dense Retrieval

Ready to build your AI chatbot?