Natural Language Processing (NLP)

Vocabulary Size

Definition

Vocabulary size (V) in NLP refers to the total number of distinct tokens in a model's lexicon. In classical NLP, vocabulary contains word types from the training corpus; in modern subword-based systems, vocabulary contains subword units. Larger vocabularies reduce the frequency of out-of-vocabulary (OOV) tokens but increase model embedding matrix size (proportional to V × embedding_dim). BERT uses a 30,000-token WordPiece vocabulary; GPT-4 uses ~100,000 BPE tokens. Vocabulary size is a key hyperparameter balancing expressiveness (larger = rarer words represented directly) against efficiency (smaller = faster, more memory-efficient).

Why It Matters

Vocabulary size affects model capability in subtle but important ways. Models with small vocabularies fragment rare words into many subword pieces, increasing sequence length and reducing semantic coherence of rare-word representations. Models with excessively large vocabularies waste parameters on rarely-seen tokens and require more training data to learn good representations for each token. For multilingual models, vocabulary must cover all target languages, typically requiring vocabularies of 100,000+ tokens to prevent excessive fragmentation of non-Latin scripts.

How It Works

Vocabulary construction for subword models uses Byte Pair Encoding (BPE) or SentencePiece: starting from character-level tokens, the algorithm iteratively merges the most frequent adjacent token pairs until reaching the target vocabulary size. The merge history defines the tokenizer. Larger corpora and more merges produce larger vocabularies that represent more complete words. For multilingual models, vocabulary is built on a balanced multilingual corpus to prevent high-resource languages from dominating the vocabulary at the expense of low-resource language coverage.

Vocabulary Size — Trade-off Analysis

Small (5K) — 5,000 tokens

OOV rate

90%

Memory use

10%

Inference speed

95%

Medium (30K) — 30,000 tokens

OOV rate

25%

Memory use

45%

Inference speed

70%

Large (100K) — 100,000 tokens

OOV rate

Memory use

90%

Inference speed

35%

Real model vocabulary sizes

GPT-2

50,257

BERT

30,522

GPT-4

~100,000

LLaMA 3

128,000

Sweet spot ~30K–100K: Subword tokenization (BPE/WordPiece) balances OOV rate, embedding memory, and inference speed.

Real-World Example

A company builds a domain-specific chatbot for medical records and finds that their 30,000-token general vocabulary fragments medical terms poorly: 'acetylsalicylic' becomes 4 subword tokens, losing cohesion. They rebuild the vocabulary at 50,000 tokens trained on a medical corpus, achieving single-token representations for the 2,000 most common medical terms. NER accuracy on medical entities improves by 6 percentage points because the model can now learn robust representations for medical terminology without fragmentation artifacts.

Common Mistakes

✕Setting vocabulary size without considering the target domain—general vocabularies perform poorly on specialized technical content
✕Making vocabulary too small for multilingual use—insufficient vocabulary forces excessive fragmentation of non-English text
✕Treating vocabulary size as fixed after deployment—as language evolves, new terms not in the vocabulary become OOV and require periodic updates

Related Terms

Subword Segmentation

Subword segmentation splits words into meaningful sub-units—like 'unbelievable' into 'un', '##believ', '##able'—balancing vocabulary coverage with manageability so NLP models handle rare and unseen words without an explicit unknown token.

Out-of-Vocabulary

Out-of-vocabulary (OOV) refers to words or tokens that appear at inference time but were absent from the model's training vocabulary, causing the model to fail to represent them properly and degrading prediction accuracy.

BERT

BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based language model pre-trained on massive text corpora that revolutionized NLP by providing rich contextual word representations that dramatically improved nearly every language task.

Tokenization

Tokenization converts raw text into a sequence of tokens—the basic units an LLM processes—using algorithms like byte-pair encoding that split text into subword pieces rather than whole words or individual characters.

Natural Language Processing (NLP)

Natural Language Processing (NLP) is the field of AI focused on enabling computers to understand, interpret, and generate human language—powering applications from chatbots and search engines to translation and sentiment analysis.

← Natural Language Processing (NLP)← Glossary Hub

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →