N-gram
Definition
An n-gram is a sequence of n consecutive tokens from a text stream. Unigrams are single tokens (n=1); bigrams are pairs (n=2); trigrams are triples (n=3). N-gram language models estimate the probability of each token given its preceding n-1 tokens using maximum likelihood estimation on large corpora with Kneser-Ney smoothing. Character n-grams are used for spell checking, language detection, and subword representations. N-gram models were the dominant approach to language modeling before neural networks and still power search engines, spell checkers, and text fingerprinting systems.
Why It Matters
N-grams capture local context and phrase coherence that single-word (unigram) models miss. The phrase 'New York' has very different meaning than 'new' and 'york' separately; bigram indexing captures this collocation. For search engines, n-gram indexing enables prefix matching and partial match retrieval. For chatbot intent classification with limited training data, n-gram features often outperform word unigrams by capturing common multi-word expressions. Character n-grams power language detection systems that identify which language a text is written in.
How It Works
N-gram language models use the Markov assumption: the probability of the next word depends only on the previous n-1 words. For bigrams: P(word|context) = count(context, word) / count(context). Kneser-Ney smoothing handles unseen n-grams by redistributing probability mass to lower-order n-grams. For text similarity, shingling converts documents to sets of k-grams and applies MinHash for efficient approximate Jaccard similarity computation. Neural language models implicitly learn n-gram statistics through their attention mechanisms.
N-Gram Sliding Window — Token Sequences & Counts
Unigrams (n=1)
Bigrams (n=2)
Trigrams (n=3)
Real-World Example
A typo-tolerant search system for a product catalog uses character trigrams to handle misspellings. The product name 'bluetooth speaker' is indexed as character trigrams: {blu, lue, uet, eto, too, oot, oth, ..., spe, pea, eak, ake, ker}. When a user searches 'blutooth speaker,' the trigram overlap of 'blutooth' with 'bluetooth' is high enough to retrieve the correct product, despite the typo. Trigram-based fuzzy matching reduced zero-result searches from 12% to 3% for product queries.
Common Mistakes
- ✕Using word n-grams for very long phrases—combinatorial explosion makes 5+ gram models impractical to index
- ✕Ignoring smoothing in n-gram language models—unseen n-grams get zero probability without smoothing, causing numerical issues
- ✕Confusing character n-grams with word n-grams—they serve different purposes and have different index sizes
Related Terms
Bag of Words
Bag of words is a text representation model that describes documents by their word frequencies, ignoring grammar and word order, producing fixed-length vectors suitable for classical machine learning algorithms.
Stemming
Stemming reduces words to their root form by stripping suffixes—converting 'running,' 'runs,' and 'ran' to 'run'—enabling search and retrieval systems to match documents regardless of word inflection.
Text Preprocessing
Text preprocessing is the collection of transformations applied to raw text before NLP model training or inference—including tokenization, normalization, and filtering—determining the quality and consistency of model inputs.
Subword Segmentation
Subword segmentation splits words into meaningful sub-units—like 'unbelievable' into 'un', '##believ', '##able'—balancing vocabulary coverage with manageability so NLP models handle rare and unseen words without an explicit unknown token.
Language Detection
Language detection automatically identifies which human language a text is written in—enabling multilingual systems to route inputs to the correct processing pipeline, translation service, or localized response.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →