Natural Language Processing (NLP)

Vocabulary Size

Definition

Vocabulary size (V) in NLP refers to the total number of distinct tokens in a model's lexicon. In classical NLP, vocabulary contains word types from the training corpus; in modern subword-based systems, vocabulary contains subword units. Larger vocabularies reduce the frequency of out-of-vocabulary (OOV) tokens but increase model embedding matrix size (proportional to V × embedding_dim). BERT uses a 30,000-token WordPiece vocabulary; GPT-4 uses ~100,000 BPE tokens. Vocabulary size is a key hyperparameter balancing expressiveness (larger = rarer words represented directly) against efficiency (smaller = faster, more memory-efficient).

Why It Matters

Vocabulary size affects model capability in subtle but important ways. Models with small vocabularies fragment rare words into many subword pieces, increasing sequence length and reducing semantic coherence of rare-word representations. Models with excessively large vocabularies waste parameters on rarely-seen tokens and require more training data to learn good representations for each token. For multilingual models, vocabulary must cover all target languages, typically requiring vocabularies of 100,000+ tokens to prevent excessive fragmentation of non-Latin scripts.

How It Works

Vocabulary construction for subword models uses Byte Pair Encoding (BPE) or SentencePiece: starting from character-level tokens, the algorithm iteratively merges the most frequent adjacent token pairs until reaching the target vocabulary size. The merge history defines the tokenizer. Larger corpora and more merges produce larger vocabularies that represent more complete words. For multilingual models, vocabulary is built on a balanced multilingual corpus to prevent high-resource languages from dominating the vocabulary at the expense of low-resource language coverage.

Vocabulary Size — Trade-off Analysis

Small (5K)5,000 tokens
OOV rate
90%
Memory use
10%
Inference speed
95%
Medium (30K)30,000 tokens
OOV rate
25%
Memory use
45%
Inference speed
70%
Large (100K)100,000 tokens
OOV rate
5%
Memory use
90%
Inference speed
35%

Real model vocabulary sizes

GPT-2
50,257
BERT
30,522
GPT-4
~100,000
LLaMA 3
128,000
Sweet spot ~30K–100K: Subword tokenization (BPE/WordPiece) balances OOV rate, embedding memory, and inference speed.

Real-World Example

A company builds a domain-specific chatbot for medical records and finds that their 30,000-token general vocabulary fragments medical terms poorly: 'acetylsalicylic' becomes 4 subword tokens, losing cohesion. They rebuild the vocabulary at 50,000 tokens trained on a medical corpus, achieving single-token representations for the 2,000 most common medical terms. NER accuracy on medical entities improves by 6 percentage points because the model can now learn robust representations for medical terminology without fragmentation artifacts.

Common Mistakes

  • Setting vocabulary size without considering the target domain—general vocabularies perform poorly on specialized technical content
  • Making vocabulary too small for multilingual use—insufficient vocabulary forces excessive fragmentation of non-English text
  • Treating vocabulary size as fixed after deployment—as language evolves, new terms not in the vocabulary become OOV and require periodic updates

Related Terms

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →
What is Vocabulary Size? Vocabulary Size Definition & Guide | 99helpers | 99helpers.com