Large Language Models (LLMs)

Byte-Pair Encoding (BPE)

Definition

Byte-Pair Encoding (BPE) is the tokenization algorithm that builds the vocabulary used by GPT, Llama, and most other modern LLMs. Starting from a base vocabulary of individual characters (or bytes), BPE iteratively finds the most frequent pair of adjacent tokens in the training corpus and merges them into a new token, repeating until the vocabulary reaches a target size (typically 32K-100K tokens). The result is a vocabulary where common words, prefixes, and suffixes are single tokens, while rare words are split into multiple subword tokens. GPT-4's cl100k_base vocabulary has 100K tokens; Llama-3 uses 128K. The merge rules learned during training are applied deterministically at tokenization time.

Why It Matters

BPE tokenization shapes how LLMs 'see' text in ways that affect both their capabilities and their limitations. Common English words are single tokens (good for efficiency), but rare technical terms, non-English words, and code constructs may be split into multiple tokens (less efficient, sometimes causing artifacts in model understanding). BPE tokenization explains why LLMs sometimes struggle with: counting characters in words (because 'strawberry' is ['str', 'awberry'] in some tokenizers—not 3 separate 'r' tokens), consistent handling of rare proper nouns, and uniform performance across languages (languages with large vocabularies relative to their training data tokenize poorly).

How It Works

BPE training process: (1) start with a character-level vocabulary (or byte-level for byte BPE); (2) encode the training corpus using the current vocabulary; (3) count frequency of all adjacent token pairs; (4) merge the most frequent pair into a new token; (5) update the corpus encoding; (6) repeat steps 3-5 until vocabulary size is reached. The resulting merge rules are fixed—the same rules applied to any text produce the same tokenization. The tiktoken library (OpenAI) and SentencePiece (Google) implement BPE for production use. Different models use different BPE vocabularies, so the same text produces different tokenizations with different models.

Byte-Pair Encoding — "chatbot" tokenization

Character-level

chatbot

7 tokens

BPE (mid-training)

chatbot

3 tokens

BPE (final vocab)

chatbot

1 token

Merge Steps

"c" + "h"

→ "ch"

c, h, a, t, b, o → ch, a, t, b, o

"ch" + "a"

→ "cha"

ch, a, t, b, o → cha, t, b, o

"b" + "o"

→ "bo"

cha, t, b, o → cha, t, bo

"bo" + "t"

→ "bot"

cha, t, bo → cha, bot

Real-World Example

A 99helpers developer notices their multilingual chatbot charges 3x more for Korean language queries than equivalent English queries. Using tiktoken to analyze token counts, they find: 'Hello, how can I help you?' = 8 tokens in English; 'Hello, how can I help you?' in Korean (안녕하세요, 어떻게 도와드릴까요?) = 23 tokens. The BPE vocabulary was trained primarily on English text; Korean characters form few multi-character tokens, requiring more individual tokens per word. Understanding this, they optimize: switching to a Korean-specific tokenizer for Korean user segments reduces token usage by 40% for those users.

Common Mistakes

✕Assuming word boundaries align with BPE token boundaries—BPE tokens can span parts of words, whole words, or even multiple words; inspect actual tokenization for your specific content.
✕Using character count as a proxy for token count across different scripts—BPE tokenization efficiency varies dramatically: English ~4 chars/token, Chinese/Japanese ~1-2 chars/token.
✕Expecting consistent word spelling detection from LLMs—because rare words are split into subwords, LLMs may 'see' the components rather than the whole word, causing edge cases in spelling-sensitive tasks.

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →

Byte-Pair Encoding (BPE)

Definition

Why It Matters

How It Works

Real-World Example

Common Mistakes

Related Terms

Tokenization

Token

Large Language Model (LLM)

Context Length

Pre-Training

Ready to build your AI chatbot?