Natural Language Processing (NLP)

Stop Words

Definition

Stop words are common words that carry little semantic content individually but are structurally necessary in natural language. Standard stop word lists for English include articles (a, an, the), prepositions (in, on, at, to), conjunctions (and, but, or), pronouns (I, you, it), and auxiliary verbs (is, was, have). Their removal reduces vocabulary size, speeds up training, and focuses models on semantically meaningful tokens. However, stop word removal is context-dependent—'not' changes meaning fundamentally ('I am happy' vs. 'I am not happy'), and 'Will' is both a stop word and a proper noun.

Why It Matters

Stop word filtering is a foundational preprocessing step that improves TF-IDF search precision and reduces model training time. By removing function words that appear in almost every document, TF-IDF scores more accurately reflect term distinctiveness. For sparse vector models and keyword search systems, stop word removal is essential for performance. However, modern transformer models are better served without stop word removal, as they learn contextual importance automatically and function words contribute to grammatical understanding.

How It Works

Stop word removal is typically implemented as a set membership check: for each token in the input, check whether it exists in a stop word list and exclude it if so. NLTK, spaCy, and scikit-learn provide built-in stop word lists for many languages. Custom lists extend these with domain-specific filler terms (e.g., 'please,' 'thanks,' 'hello' in customer support text). Order-sensitive applications sometimes use 'soft' stop word filtering via TF-IDF downweighting rather than hard removal to preserve sequence information.

Stop Words — Before & After Removal

Before removal

Thequickbrownfoxisaveryfastandcleveranimal

remove stop words

After removal

quickbrownfoxfastcleveranimal

Common stop words

theisaanandorveryofintoit+ many more

11 tokens

original

6 tokens

content words retained

Real-World Example

A help center search engine using BM25 indexing applies stop word filtering before indexing. Articles about 'how to reset your password' are indexed as {reset, password} after stop word removal, which dramatically improves retrieval precision—without filtering, common words like 'how,' 'to,' 'your' would dominate the index and dilute relevance scores. The filtered index achieves a mean reciprocal rank of 0.87 vs. 0.71 without filtering.

Common Mistakes

✕Removing 'not,' 'no,' and 'never' as stop words—negations fundamentally change meaning and should be retained
✕Using a generic stop word list without domain customization—'free' is a stop word in many lists but is highly meaningful in pricing contexts
✕Applying stop word removal to transformer model inputs—transformers don't benefit and may be harmed by removing function words

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →

Stop Words

Definition

Why It Matters

How It Works

Real-World Example

Common Mistakes

Related Terms

Stemming

Lemmatization

Text Preprocessing

Bag of Words

N-gram

Ready to build your AI chatbot?