Natural Language Processing (NLP)

Bag of Words

Definition

The bag-of-words (BoW) model represents a text document as an unordered collection of its words with their frequencies, discarding all positional and grammatical information. A vocabulary of V unique words produces a V-dimensional sparse vector for each document, where each dimension stores the word's count or TF-IDF weight. Despite its simplicity, BoW with TF-IDF weighting achieves competitive performance on many text classification and information retrieval tasks. It remains widely used in search engines, spam filters, and as a baseline for more complex models.

Why It Matters

Bag of words is the foundational building block for classical NLP systems, providing a numerically tractable way to represent text for machine learning before deep learning made sequence modeling practical. For applications requiring fast inference with limited compute—email routing, keyword filtering, log classification—BoW with logistic regression or naive Bayes still delivers strong results. Understanding BoW is essential background knowledge for anyone working with NLP, as it clarifies why modern embedding approaches were necessary and what problems they solve.

How It Works

BoW construction: (1) tokenize all documents and build a vocabulary V of all unique tokens (optionally filtered by frequency and stop words); (2) represent each document as a V-dimensional vector where position i holds the frequency of word i in that document. TF-IDF weighting replaces raw counts with term-frequency × inverse-document-frequency scores, downweighting words that appear in many documents and upweighting distinctive terms. Scikit-learn's CountVectorizer and TfidfVectorizer implement this pipeline. Bigram BoW extends the vocabulary to include two-word phrases.

Bag-of-Words — Document to Vector Representation

Doc A

"the cat sat on the mat"

Doc B

"the dog sat by the door"

tokenize + count
WordIndexDoc ADoc B
the022
sat111
cat210
mat310
dog401
door501

Doc A vector

[2, 1, 1, 1, 0, 0]

6-dim sparse vector

Doc B vector

[2, 1, 0, 0, 1, 1]

6-dim sparse vector

Note: Word order is discarded — "cat sat" and "sat cat" produce identical vectors. This is the core limitation of BoW.

Real-World Example

A legacy email routing system uses TF-IDF bag-of-words vectors with a multinomial naive Bayes classifier to route 10,000 daily support emails across 15 departments. The system achieves 89% routing accuracy on common email types and runs in under 1 millisecond per email on a single CPU core. When the team evaluated replacing it with a fine-tuned BERT classifier (94% accuracy), they found the 5% accuracy gain did not justify the 200x inference cost increase for this high-volume, low-stakes routing task.

Common Mistakes

  • Ignoring the loss of word order—'dog bites man' and 'man bites dog' are identical BoW representations
  • Not applying TF-IDF weighting—raw word counts overweight common words that appear in every document
  • Using BoW for tasks requiring semantic understanding—words with different surface forms but same meaning (car/automobile) are treated as unrelated

Related Terms

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →
What is Bag of Words? Bag of Words Definition & Guide | 99helpers | 99helpers.com