Natural Language Processing (NLP)

Out-of-Vocabulary

Definition

Out-of-vocabulary (OOV) tokens are words or subwords that a model's tokenizer cannot represent as single learned units, requiring either splitting into unknown token placeholders, character decomposition, or subword fragmentation. Classical word-level models had strict OOV problems—any unseen word becomes a single [UNK] token with no learned representation. Modern subword models (BPE, WordPiece, SentencePiece) largely eliminate OOV by decomposing unknown words into known subword pieces; any Unicode character sequence can be represented through character-level fallback. However, severe fragmentation of OOV words into many subword pieces still degrades model performance on those words.

Why It Matters

OOV handling determines how well NLP systems generalize to real-world inputs containing domain jargon, neologisms, spelling variations, and proper nouns. A customer support bot trained before a product rebrand will encounter the new product name as OOV, potentially misclassifying issues related to that product. Medical chatbots receive drug names and medical terminology not in general-purpose vocabularies. Understanding OOV handling helps practitioners choose appropriate models, extend vocabularies for domain adaptation, and interpret unexpected model failures on specific inputs.

How It Works

Subword models address OOV through hierarchical fallback: first try to match the full word as a vocabulary token; if not found, try common prefixes; continue breaking down to character-level pieces. WordPiece marks continuation subwords with '##' (e.g., 'unprecedented' → 'un', '##pre', '##ced', '##ented'). SentencePiece uses '_' for word-start markers. Character-level fallback in SentencePiece ensures any Unicode string can be encoded. FastText character n-gram embeddings handle OOV by summing character n-gram embeddings for unseen words, producing reasonable semantic representations even for completely novel words.

Out-of-Vocabulary — Known Vocab vs. OOV Handling

Known Vocabulary (sample)

thecatsatonmatdogranfastslowbig… +50k more

Lookup Results

catIN-VOCABDirect lookup → cat_id: 42

grokkedOOVOOV → subword: grok + ##ked

ChatGPTOOVOOV → subword: Chat + ##G + ##PT

ranIN-VOCABDirect lookup → ran_id: 17

COVID-19OOVOOV → [UNK] token

Subword (BPE)

Split into sub-units; most common in LLMs

[UNK] Token

Replace with unknown placeholder

Character-level

Fall back to character embeddings

Real-World Example

A retail chatbot deployed in December handles seasonal product names like 'SantaBot Pro 2026' as OOV tokens because they didn't exist during training. The product name fragments into ['Santa', '##Bot', 'Pro', '20', '##26']—5 subword tokens that the model treats incoherently. The team adds domain-specific vocabulary terms to the tokenizer vocabulary and fine-tunes the model on new product content before each major product launch, maintaining coherent product name representations and preventing OOV-induced classification errors on support queries.

Common Mistakes

✕Assuming subword models have no OOV issues—severe fragmentation of technical or domain-specific terms degrades those terms' representations
✕Not monitoring OOV rates in production—high OOV rates on specific query types signal vocabulary gaps requiring domain adaptation
✕Extending vocabularies without fine-tuning—new vocabulary tokens have random initial embeddings and provide no semantic signal until trained

Related Terms

Vocabulary Size

Vocabulary size is the number of unique tokens a language model or NLP system recognizes, determining the trade-off between model expressiveness, memory requirements, and the handling of unseen words.

Subword Segmentation

Subword segmentation splits words into meaningful sub-units—like 'unbelievable' into 'un', '##believ', '##able'—balancing vocabulary coverage with manageability so NLP models handle rare and unseen words without an explicit unknown token.

Text Preprocessing

Text preprocessing is the collection of transformations applied to raw text before NLP model training or inference—including tokenization, normalization, and filtering—determining the quality and consistency of model inputs.

BERT

BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based language model pre-trained on massive text corpora that revolutionized NLP by providing rich contextual word representations that dramatically improved nearly every language task.

Natural Language Processing (NLP)

Natural Language Processing (NLP) is the field of AI focused on enabling computers to understand, interpret, and generate human language—powering applications from chatbots and search engines to translation and sentiment analysis.

← Natural Language Processing (NLP)← Glossary Hub

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →