Natural Language Processing (NLP)

N-gram

Definition

An n-gram is a sequence of n consecutive tokens from a text stream. Unigrams are single tokens (n=1); bigrams are pairs (n=2); trigrams are triples (n=3). N-gram language models estimate the probability of each token given its preceding n-1 tokens using maximum likelihood estimation on large corpora with Kneser-Ney smoothing. Character n-grams are used for spell checking, language detection, and subword representations. N-gram models were the dominant approach to language modeling before neural networks and still power search engines, spell checkers, and text fingerprinting systems.

Why It Matters

N-grams capture local context and phrase coherence that single-word (unigram) models miss. The phrase 'New York' has very different meaning than 'new' and 'york' separately; bigram indexing captures this collocation. For search engines, n-gram indexing enables prefix matching and partial match retrieval. For chatbot intent classification with limited training data, n-gram features often outperform word unigrams by capturing common multi-word expressions. Character n-grams power language detection systems that identify which language a text is written in.

How It Works

N-gram language models use the Markov assumption: the probability of the next word depends only on the previous n-1 words. For bigrams: P(word|context) = count(context, word) / count(context). Kneser-Ney smoothing handles unseen n-grams by redistributing probability mass to lower-order n-grams. For text similarity, shingling converts documents to sets of k-grams and applies MinHash for efficient approximate Jaccard similarity computation. Neural language models implicitly learn n-gram statistics through their attention mechanisms.

N-Gram Sliding Window — Token Sequences & Counts

The

quick

brown

fox

jumps

Unigrams (n=1)

The

quick

brown

fox

jumps

Bigrams (n=2)

The quick

quick brown

brown fox

fox jumps

Trigrams (n=3)

The quick brown

quick brown fox

brown fox jumps

Vocabulary grows as n increases — sparsity problem at high n

Real-World Example

A typo-tolerant search system for a product catalog uses character trigrams to handle misspellings. The product name 'bluetooth speaker' is indexed as character trigrams: {blu, lue, uet, eto, too, oot, oth, ..., spe, pea, eak, ake, ker}. When a user searches 'blutooth speaker,' the trigram overlap of 'blutooth' with 'bluetooth' is high enough to retrieve the correct product, despite the typo. Trigram-based fuzzy matching reduced zero-result searches from 12% to 3% for product queries.

Common Mistakes

✕Using word n-grams for very long phrases—combinatorial explosion makes 5+ gram models impractical to index
✕Ignoring smoothing in n-gram language models—unseen n-grams get zero probability without smoothing, causing numerical issues
✕Confusing character n-grams with word n-grams—they serve different purposes and have different index sizes

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →

N-gram

Definition

Why It Matters

How It Works

Real-World Example

Common Mistakes

Related Terms

Bag of Words

Stemming

Text Preprocessing

Subword Segmentation

Language Detection

Ready to build your AI chatbot?