Bag of Words
Definition
The bag-of-words (BoW) model represents a text document as an unordered collection of its words with their frequencies, discarding all positional and grammatical information. A vocabulary of V unique words produces a V-dimensional sparse vector for each document, where each dimension stores the word's count or TF-IDF weight. Despite its simplicity, BoW with TF-IDF weighting achieves competitive performance on many text classification and information retrieval tasks. It remains widely used in search engines, spam filters, and as a baseline for more complex models.
Why It Matters
Bag of words is the foundational building block for classical NLP systems, providing a numerically tractable way to represent text for machine learning before deep learning made sequence modeling practical. For applications requiring fast inference with limited compute—email routing, keyword filtering, log classification—BoW with logistic regression or naive Bayes still delivers strong results. Understanding BoW is essential background knowledge for anyone working with NLP, as it clarifies why modern embedding approaches were necessary and what problems they solve.
How It Works
BoW construction: (1) tokenize all documents and build a vocabulary V of all unique tokens (optionally filtered by frequency and stop words); (2) represent each document as a V-dimensional vector where position i holds the frequency of word i in that document. TF-IDF weighting replaces raw counts with term-frequency × inverse-document-frequency scores, downweighting words that appear in many documents and upweighting distinctive terms. Scikit-learn's CountVectorizer and TfidfVectorizer implement this pipeline. Bigram BoW extends the vocabulary to include two-word phrases.
Bag-of-Words — Document to Vector Representation
Doc A
"the cat sat on the mat"
Doc B
"the dog sat by the door"
Doc A vector
[2, 1, 1, 1, 0, 0]
6-dim sparse vector
Doc B vector
[2, 1, 0, 0, 1, 1]
6-dim sparse vector
Note: Word order is discarded — "cat sat" and "sat cat" produce identical vectors. This is the core limitation of BoW.
Real-World Example
A legacy email routing system uses TF-IDF bag-of-words vectors with a multinomial naive Bayes classifier to route 10,000 daily support emails across 15 departments. The system achieves 89% routing accuracy on common email types and runs in under 1 millisecond per email on a single CPU core. When the team evaluated replacing it with a fine-tuned BERT classifier (94% accuracy), they found the 5% accuracy gain did not justify the 200x inference cost increase for this high-volume, low-stakes routing task.
Common Mistakes
- ✕Ignoring the loss of word order—'dog bites man' and 'man bites dog' are identical BoW representations
- ✕Not applying TF-IDF weighting—raw word counts overweight common words that appear in every document
- ✕Using BoW for tasks requiring semantic understanding—words with different surface forms but same meaning (car/automobile) are treated as unrelated
Related Terms
N-gram
An n-gram is a contiguous sequence of n items—words, characters, or subwords—extracted from text, forming the building block of language models, search indexes, and text similarity algorithms.
Word Embeddings
Word embeddings are dense vector representations of words in a continuous numerical space where semantically similar words are positioned close together, enabling machines to understand word meaning through geometry.
Text Classification
Text classification automatically assigns predefined labels to text documents—such as topic, urgency, language, or intent—enabling large-scale categorization of unstructured content without manual review.
Stop Words
Stop words are high-frequency function words—such as 'the,' 'is,' 'at,' and 'which'—that are filtered out during text preprocessing to reduce noise and focus NLP models on content-bearing words.
Text Preprocessing
Text preprocessing is the collection of transformations applied to raw text before NLP model training or inference—including tokenization, normalization, and filtering—determining the quality and consistency of model inputs.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →