Natural Language Processing (NLP)

Subword Segmentation

Definition

Subword segmentation algorithms partition text into units smaller than full words but larger than characters, creating vocabularies that efficiently cover diverse text without a combinatorial explosion of word types. Byte Pair Encoding (BPE), used by GPT models, starts with characters and iteratively merges the most frequent adjacent pairs. WordPiece, used by BERT, optimizes vocabulary choice using language model likelihood. SentencePiece is an implementation that works directly on raw Unicode text without pre-tokenization. All three produce fixed-size vocabularies (16,000-100,000 tokens) that represent common words as single tokens and decompose rare words into recognizable sub-units.

Why It Matters

Subword segmentation solved the fundamental vocabulary problem in neural NLP: word-level models couldn't represent unseen words; character-level models required much longer sequences and harder learning. Subword representations enable models to learn morphological structure (the prefix 'un-' consistently signals negation across many words) while keeping sequence lengths manageable. For multilingual models, subword segmentation with a shared vocabulary enables cross-lingual parameter sharing. Understanding tokenization is critical because all modern NLP models—GPT, BERT, T5, LLaMA—are fundamentally subword-level models.

How It Works

BPE construction: (1) initialize vocabulary with all characters; (2) count all adjacent symbol pairs in the corpus; (3) merge the most frequent pair into a new symbol; (4) repeat until vocabulary size is reached. The merge history defines the tokenizer applied at inference. WordPiece differs by merging pairs that maximize language model probability rather than raw frequency. At inference, a word is greedily segmented using the longest-match algorithm: first try to match as a single token; if not in vocabulary, try the longest prefix that is in vocabulary; repeat on the remainder. SentencePiece applies the same algorithms to raw text, using a special '_' marker for word boundaries.

Subword Segmentation — BPE / WordPiece / SentencePiece

BPE
WordPiece (##)
SentencePiece (▁)
"unhappiness"
BPE:
unhappiness
WordPiece:
un##hap##pi##ness
SentencePiece:
▁unhappiness
"tokenization"
BPE:
tokenization
WordPiece:
token##ization
SentencePiece:
▁tokenization
"chatbot"
BPE:
chatbot
WordPiece:
chat##bot
SentencePiece:
▁chatbot
Key benefit: Rare words are split into known subwords — eliminating out-of-vocabulary tokens while keeping vocabulary size manageable.

Real-World Example

A multilingual e-commerce platform builds a unified product search system. Their SentencePiece tokenizer trained on 20 languages creates shared subword representations that work across English, German (compound-heavy: 'Staubsaugerbeutel' → 'Staub', 'sauger', 'beutel'), Japanese (character-level fallback), and Arabic (right-to-left script with complex morphology). The shared vocabulary enables a single embedding layer and model to handle all 20 languages without language-specific components, reducing infrastructure complexity by 75%.

Common Mistakes

  • Using a tokenizer trained on one domain/language for a very different domain/language—vocabulary fragmentation increases and performance drops
  • Ignoring tokenizer-model pairing—BERT and GPT-4 use different tokenizers; mixing them produces incorrect inputs
  • Counting tokens incorrectly in API cost estimation—subword tokenization means character counts are unreliable predictors of token counts

Related Terms

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →
What is Subword Segmentation? Subword Segmentation Definition & Guide | 99helpers | 99helpers.com