TF-IDF
Definition
TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical weighting scheme that quantifies how characteristic a term is of a specific document within a collection. It combines two components: TF (Term Frequency) — how often the term appears in the document (more occurrences = more important to that document), and IDF (Inverse Document Frequency) — how rarely the term appears across all documents (rare terms are more distinctive than common terms like 'the' or 'is'). TF-IDF score = TF(term, document) × IDF(term, collection). Documents are represented as TF-IDF vectors where each dimension corresponds to a vocabulary term, and these vectors serve as the sparse representations used for keyword matching.
Why It Matters
TF-IDF is the conceptual predecessor to BM25 and the foundation for understanding how keyword search works. While modern RAG systems typically use BM25 (which improves on TF-IDF with better term frequency saturation and length normalization), TF-IDF remains widely used for text classification, document similarity, and as a simple baseline for retrieval. Understanding TF-IDF helps RAG practitioners reason about why certain terms are highly ranked in keyword search results — terms that appear frequently in a specific document but rarely across the corpus are highly distinctive signals for that document.
How It Works
TF-IDF computation: TF(t,d) is typically computed as: count(t in d) / total terms in d. IDF(t) = log(total documents / documents containing t). TF-IDF(t,d) = TF(t,d) × IDF(t). For a collection of documents, TF-IDF vectors can be computed using scikit-learn's TfidfVectorizer. Documents and queries are both represented as TF-IDF vectors, and cosine similarity between them provides a relevance score. In practice, TF-IDF vectorizers are used for simple keyword retrieval, feature extraction for ML models, and document clustering, while BM25 is preferred for ranked retrieval due to its improved formula.
TF-IDF Scoring
Term
TF
IDF
TF-IDF
"chatbot"
0.08
1.2
0.096
"support"
0.12
0.6
0.072
"retrieval"
0.04
2.8
0.112
"the"
0.15
0.05
0.008
High IDF = rare word = more distinctive. "the" scores near zero despite high frequency.
Real-World Example
A 99helpers customer uses TF-IDF analysis on their knowledge base to identify the most distinctive terms in each article — the terms that uniquely characterize each article relative to the rest of the knowledge base. These high-TF-IDF terms serve as automatic tags for each article, improving the metadata available for filtering and helping content managers identify what each article is distinctively about. Articles with no high-TF-IDF terms are flagged for review, as they may contain only common terms that make them difficult to retrieve with keyword search.
Common Mistakes
- ✕Using TF-IDF for retrieval when BM25 is available — BM25 addresses known weaknesses in TF-IDF (unbounded term frequency, no length normalization) and consistently outperforms it for retrieval tasks
- ✕Applying TF-IDF without removing stop words — common words like 'the', 'is', and 'a' artificially inflate TF scores; always remove stop words before computing TF-IDF
- ✕Treating TF-IDF vectors as equivalent to embedding vectors — TF-IDF vectors are sparse and capture vocabulary statistics; embedding vectors are dense and capture semantic meaning; they serve different purposes
Related Terms
BM25
BM25 (Best Match 25) is the industry-standard sparse retrieval algorithm that scores documents against a query based on term frequency, inverse document frequency, and document length normalization, widely used in search engines and hybrid RAG systems.
Sparse Retrieval
Sparse retrieval is a search approach based on exact or weighted keyword matching, where documents and queries are represented as high-dimensional sparse vectors with most values being zero, and similarity is measured by term overlap.
Inverted Index
An inverted index is a data structure that maps each unique term in a document collection to the list of documents containing that term, enabling fast full-text keyword search and powering BM25 and other sparse retrieval algorithms.
Hybrid Retrieval
Hybrid retrieval combines dense (semantic) and sparse (keyword) search methods to leverage the strengths of both, using a fusion step to merge their results into a single ranked list for better overall retrieval quality.
Dense Retrieval
Dense retrieval is a retrieval approach that encodes both queries and documents into dense embedding vectors and finds relevant documents by computing vector similarity, enabling semantic matching beyond exact keyword overlap.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →