Natural Language Processing (NLP)

Subword Segmentation

Definition

Subword segmentation algorithms partition text into units smaller than full words but larger than characters, creating vocabularies that efficiently cover diverse text without a combinatorial explosion of word types. Byte Pair Encoding (BPE), used by GPT models, starts with characters and iteratively merges the most frequent adjacent pairs. WordPiece, used by BERT, optimizes vocabulary choice using language model likelihood. SentencePiece is an implementation that works directly on raw Unicode text without pre-tokenization. All three produce fixed-size vocabularies (16,000-100,000 tokens) that represent common words as single tokens and decompose rare words into recognizable sub-units.

Why It Matters

Subword segmentation solved the fundamental vocabulary problem in neural NLP: word-level models couldn't represent unseen words; character-level models required much longer sequences and harder learning. Subword representations enable models to learn morphological structure (the prefix 'un-' consistently signals negation across many words) while keeping sequence lengths manageable. For multilingual models, subword segmentation with a shared vocabulary enables cross-lingual parameter sharing. Understanding tokenization is critical because all modern NLP models—GPT, BERT, T5, LLaMA—are fundamentally subword-level models.

How It Works

BPE construction: (1) initialize vocabulary with all characters; (2) count all adjacent symbol pairs in the corpus; (3) merge the most frequent pair into a new symbol; (4) repeat until vocabulary size is reached. The merge history defines the tokenizer applied at inference. WordPiece differs by merging pairs that maximize language model probability rather than raw frequency. At inference, a word is greedily segmented using the longest-match algorithm: first try to match as a single token; if not in vocabulary, try the longest prefix that is in vocabulary; repeat on the remainder. SentencePiece applies the same algorithms to raw text, using a special '_' marker for word boundaries.

Subword Segmentation — BPE / WordPiece / SentencePiece

BPE

WordPiece (##)

SentencePiece (▁)

"unhappiness"

BPE:

unhappiness

WordPiece:

un##hap##pi##ness

SentencePiece:

▁unhappiness

"tokenization"

BPE:

tokenization

WordPiece:

token##ization

SentencePiece:

▁tokenization

"chatbot"

BPE:

chatbot

WordPiece:

chat##bot

SentencePiece:

▁chatbot

Key benefit: Rare words are split into known subwords — eliminating out-of-vocabulary tokens while keeping vocabulary size manageable.

Real-World Example

A multilingual e-commerce platform builds a unified product search system. Their SentencePiece tokenizer trained on 20 languages creates shared subword representations that work across English, German (compound-heavy: 'Staubsaugerbeutel' → 'Staub', 'sauger', 'beutel'), Japanese (character-level fallback), and Arabic (right-to-left script with complex morphology). The shared vocabulary enables a single embedding layer and model to handle all 20 languages without language-specific components, reducing infrastructure complexity by 75%.

Common Mistakes

✕Using a tokenizer trained on one domain/language for a very different domain/language—vocabulary fragmentation increases and performance drops
✕Ignoring tokenizer-model pairing—BERT and GPT-4 use different tokenizers; mixing them produces incorrect inputs
✕Counting tokens incorrectly in API cost estimation—subword tokenization means character counts are unreliable predictors of token counts

Related Terms

Vocabulary Size

Vocabulary size is the number of unique tokens a language model or NLP system recognizes, determining the trade-off between model expressiveness, memory requirements, and the handling of unseen words.

Out-of-Vocabulary

Out-of-vocabulary (OOV) refers to words or tokens that appear at inference time but were absent from the model's training vocabulary, causing the model to fail to represent them properly and degrading prediction accuracy.

BERT

BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based language model pre-trained on massive text corpora that revolutionized NLP by providing rich contextual word representations that dramatically improved nearly every language task.

Natural Language Processing (NLP)

Natural Language Processing (NLP) is the field of AI focused on enabling computers to understand, interpret, and generate human language—powering applications from chatbots and search engines to translation and sentiment analysis.

Text Preprocessing

Text preprocessing is the collection of transformations applied to raw text before NLP model training or inference—including tokenization, normalization, and filtering—determining the quality and consistency of model inputs.

← Natural Language Processing (NLP)← Glossary Hub

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →