Retrieval-Augmented Generation (RAG)

TF-IDF

Definition

TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical weighting scheme that quantifies how characteristic a term is of a specific document within a collection. It combines two components: TF (Term Frequency) — how often the term appears in the document (more occurrences = more important to that document), and IDF (Inverse Document Frequency) — how rarely the term appears across all documents (rare terms are more distinctive than common terms like 'the' or 'is'). TF-IDF score = TF(term, document) × IDF(term, collection). Documents are represented as TF-IDF vectors where each dimension corresponds to a vocabulary term, and these vectors serve as the sparse representations used for keyword matching.

Why It Matters

TF-IDF is the conceptual predecessor to BM25 and the foundation for understanding how keyword search works. While modern RAG systems typically use BM25 (which improves on TF-IDF with better term frequency saturation and length normalization), TF-IDF remains widely used for text classification, document similarity, and as a simple baseline for retrieval. Understanding TF-IDF helps RAG practitioners reason about why certain terms are highly ranked in keyword search results — terms that appear frequently in a specific document but rarely across the corpus are highly distinctive signals for that document.

How It Works

TF-IDF computation: TF(t,d) is typically computed as: count(t in d) / total terms in d. IDF(t) = log(total documents / documents containing t). TF-IDF(t,d) = TF(t,d) × IDF(t). For a collection of documents, TF-IDF vectors can be computed using scikit-learn's TfidfVectorizer. Documents and queries are both represented as TF-IDF vectors, and cosine similarity between them provides a relevance score. In practice, TF-IDF vectorizers are used for simple keyword retrieval, feature extraction for ML models, and document clustering, while BM25 is preferred for ranked retrieval due to its improved formula.

TF-IDF Scoring

Term

TF

IDF

TF-IDF

"chatbot"

0.08

1.2

0.096

"support"

0.12

0.6

0.072

"retrieval"

0.04

2.8

0.112

"the"

0.15

0.05

0.008

High IDF = rare word = more distinctive. "the" scores near zero despite high frequency.

Real-World Example

A 99helpers customer uses TF-IDF analysis on their knowledge base to identify the most distinctive terms in each article — the terms that uniquely characterize each article relative to the rest of the knowledge base. These high-TF-IDF terms serve as automatic tags for each article, improving the metadata available for filtering and helping content managers identify what each article is distinctively about. Articles with no high-TF-IDF terms are flagged for review, as they may contain only common terms that make them difficult to retrieve with keyword search.

Common Mistakes

  • Using TF-IDF for retrieval when BM25 is available — BM25 addresses known weaknesses in TF-IDF (unbounded term frequency, no length normalization) and consistently outperforms it for retrieval tasks
  • Applying TF-IDF without removing stop words — common words like 'the', 'is', and 'a' artificially inflate TF scores; always remove stop words before computing TF-IDF
  • Treating TF-IDF vectors as equivalent to embedding vectors — TF-IDF vectors are sparse and capture vocabulary statistics; embedding vectors are dense and capture semantic meaning; they serve different purposes

Related Terms

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →
What is TF-IDF? TF-IDF Definition & Guide | 99helpers | 99helpers.com