Stop Words
Definition
Stop words are common words that carry little semantic content individually but are structurally necessary in natural language. Standard stop word lists for English include articles (a, an, the), prepositions (in, on, at, to), conjunctions (and, but, or), pronouns (I, you, it), and auxiliary verbs (is, was, have). Their removal reduces vocabulary size, speeds up training, and focuses models on semantically meaningful tokens. However, stop word removal is context-dependent—'not' changes meaning fundamentally ('I am happy' vs. 'I am not happy'), and 'Will' is both a stop word and a proper noun.
Why It Matters
Stop word filtering is a foundational preprocessing step that improves TF-IDF search precision and reduces model training time. By removing function words that appear in almost every document, TF-IDF scores more accurately reflect term distinctiveness. For sparse vector models and keyword search systems, stop word removal is essential for performance. However, modern transformer models are better served without stop word removal, as they learn contextual importance automatically and function words contribute to grammatical understanding.
How It Works
Stop word removal is typically implemented as a set membership check: for each token in the input, check whether it exists in a stop word list and exclude it if so. NLTK, spaCy, and scikit-learn provide built-in stop word lists for many languages. Custom lists extend these with domain-specific filler terms (e.g., 'please,' 'thanks,' 'hello' in customer support text). Order-sensitive applications sometimes use 'soft' stop word filtering via TF-IDF downweighting rather than hard removal to preserve sequence information.
Stop Words — Before & After Removal
Before removal
After removal
Common stop words
Real-World Example
A help center search engine using BM25 indexing applies stop word filtering before indexing. Articles about 'how to reset your password' are indexed as {reset, password} after stop word removal, which dramatically improves retrieval precision—without filtering, common words like 'how,' 'to,' 'your' would dominate the index and dilute relevance scores. The filtered index achieves a mean reciprocal rank of 0.87 vs. 0.71 without filtering.
Common Mistakes
- ✕Removing 'not,' 'no,' and 'never' as stop words—negations fundamentally change meaning and should be retained
- ✕Using a generic stop word list without domain customization—'free' is a stop word in many lists but is highly meaningful in pricing contexts
- ✕Applying stop word removal to transformer model inputs—transformers don't benefit and may be harmed by removing function words
Related Terms
Stemming
Stemming reduces words to their root form by stripping suffixes—converting 'running,' 'runs,' and 'ran' to 'run'—enabling search and retrieval systems to match documents regardless of word inflection.
Lemmatization
Lemmatization reduces words to their dictionary base form—their lemma—using morphological analysis and vocabulary lookups, producing linguistically correct roots that improve NLP model accuracy compared to stemming.
Text Preprocessing
Text preprocessing is the collection of transformations applied to raw text before NLP model training or inference—including tokenization, normalization, and filtering—determining the quality and consistency of model inputs.
Bag of Words
Bag of words is a text representation model that describes documents by their word frequencies, ignoring grammar and word order, producing fixed-length vectors suitable for classical machine learning algorithms.
N-gram
An n-gram is a contiguous sequence of n items—words, characters, or subwords—extracted from text, forming the building block of language models, search indexes, and text similarity algorithms.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →