Byte-Pair Encoding (BPE)
Definition
Byte-Pair Encoding (BPE) is the tokenization algorithm that builds the vocabulary used by GPT, Llama, and most other modern LLMs. Starting from a base vocabulary of individual characters (or bytes), BPE iteratively finds the most frequent pair of adjacent tokens in the training corpus and merges them into a new token, repeating until the vocabulary reaches a target size (typically 32K-100K tokens). The result is a vocabulary where common words, prefixes, and suffixes are single tokens, while rare words are split into multiple subword tokens. GPT-4's cl100k_base vocabulary has 100K tokens; Llama-3 uses 128K. The merge rules learned during training are applied deterministically at tokenization time.
Why It Matters
BPE tokenization shapes how LLMs 'see' text in ways that affect both their capabilities and their limitations. Common English words are single tokens (good for efficiency), but rare technical terms, non-English words, and code constructs may be split into multiple tokens (less efficient, sometimes causing artifacts in model understanding). BPE tokenization explains why LLMs sometimes struggle with: counting characters in words (because 'strawberry' is ['str', 'awberry'] in some tokenizers—not 3 separate 'r' tokens), consistent handling of rare proper nouns, and uniform performance across languages (languages with large vocabularies relative to their training data tokenize poorly).
How It Works
BPE training process: (1) start with a character-level vocabulary (or byte-level for byte BPE); (2) encode the training corpus using the current vocabulary; (3) count frequency of all adjacent token pairs; (4) merge the most frequent pair into a new token; (5) update the corpus encoding; (6) repeat steps 3-5 until vocabulary size is reached. The resulting merge rules are fixed—the same rules applied to any text produce the same tokenization. The tiktoken library (OpenAI) and SentencePiece (Google) implement BPE for production use. Different models use different BPE vocabularies, so the same text produces different tokenizations with different models.
Byte-Pair Encoding — "chatbot" tokenization
Merge Steps
Real-World Example
A 99helpers developer notices their multilingual chatbot charges 3x more for Korean language queries than equivalent English queries. Using tiktoken to analyze token counts, they find: 'Hello, how can I help you?' = 8 tokens in English; 'Hello, how can I help you?' in Korean (안녕하세요, 어떻게 도와드릴까요?) = 23 tokens. The BPE vocabulary was trained primarily on English text; Korean characters form few multi-character tokens, requiring more individual tokens per word. Understanding this, they optimize: switching to a Korean-specific tokenizer for Korean user segments reduces token usage by 40% for those users.
Common Mistakes
- ✕Assuming word boundaries align with BPE token boundaries—BPE tokens can span parts of words, whole words, or even multiple words; inspect actual tokenization for your specific content.
- ✕Using character count as a proxy for token count across different scripts—BPE tokenization efficiency varies dramatically: English ~4 chars/token, Chinese/Japanese ~1-2 chars/token.
- ✕Expecting consistent word spelling detection from LLMs—because rare words are split into subwords, LLMs may 'see' the components rather than the whole word, causing edge cases in spelling-sensitive tasks.
Related Terms
Tokenization
Tokenization converts raw text into a sequence of tokens—the basic units an LLM processes—using algorithms like byte-pair encoding that split text into subword pieces rather than whole words or individual characters.
Token
A token is the basic unit of text an LLM processes—roughly 4 characters or 3/4 of an English word. LLM APIs charge per token, context windows are measured in tokens, and generation speed is measured in tokens per second.
Large Language Model (LLM)
A large language model is a neural network trained on vast amounts of text that learns to predict and generate human-like text, enabling tasks like answering questions, writing, translation, and code generation.
Context Length
Context length is the maximum number of tokens an LLM can process in a single request—encompassing the system prompt, conversation history, retrieved documents, and the response—determining how much information the model can consider simultaneously.
Pre-Training
Pre-training is the foundational phase of LLM development where the model learns language understanding and world knowledge by predicting the next token across vast text corpora, before any task-specific optimization.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →