Tokenization
Definition
Tokenization is the process of splitting input text into discrete units called tokens that an LLM can process. Modern LLMs use subword tokenization: common words are single tokens ('the', 'is', 'hello'), common prefixes and suffixes are tokens ('un-', '-ing', '-tion'), and rare words are split into multiple subword tokens ('transformer' might be ['transform', 'er']). A typical LLM vocabulary contains 50,000-100,000+ tokens. Text is tokenized before being converted to embeddings for processing, and the model outputs token IDs that are decoded back to text. Tokenization is model-specific—the same text tokenized with GPT-4's tokenizer and Llama-3's tokenizer will produce different token sequences.
Why It Matters
Tokenization directly affects API costs (billed per token), context window usage (longer texts consume more tokens), and LLM performance on different content types. Non-English text, code, and technical jargon tokenize less efficiently—a word in Turkish or Korean may require 3-5 tokens where an equivalent English word requires 1-2. This means the same concept expressed in different languages can have very different costs and context window consumption. For 99helpers customers building multilingual chatbots or processing code-heavy documentation, understanding tokenization efficiency helps optimize context window usage and control API costs.
How It Works
Most LLM providers use Byte-Pair Encoding (BPE) or derivatives. BPE starts with a vocabulary of individual characters, then iteratively merges the most frequent adjacent pairs into new tokens. After many merge operations, common subwords, words, and phrases become single tokens. At tokenization time, the algorithm greedily applies the longest matching token from the vocabulary. Tools like OpenAI's tiktoken library let developers count tokens before making API calls. A rough rule of thumb: 1 token ≈ 4 characters or 0.75 English words. 1,000 tokens ≈ 750 words or about 1.5 pages of text.
Tokenization Pipeline — Text → Token IDs → Embeddings
1. Raw Text
2. BPE Tokens
3. Token IDs
4. Embedding Vectors
First token "LL" → vector (d=8 shown, real d=4096+)
BPE (Byte-Pair Encoding) merges the most frequent character pairs iteratively to build a vocabulary of ~50K–100K subword tokens that covers all text efficiently.
Real-World Example
A 99helpers developer builds a feature that processes customer-uploaded PDFs. Before implementing, they use tiktoken to estimate token counts: a 10-page PDF typically contains ~5,000-7,000 tokens. With GPT-4o's 128K context limit, they can process 18-25 pages per request. For pricing calculations: at $0.0025 per 1K input tokens, a 5,000-token document costs $0.0125 to process—reasonable for single documents, but 100 documents/day costs $1.25/day or $37.50/month just for document ingestion, informing their cost model.
Common Mistakes
- ✕Counting words instead of tokens for context window and cost estimation—tokens and words have a 4:3 ratio on average for English but diverge significantly for code, non-Latin scripts, and technical content.
- ✕Assuming the same tokenizer across all models—switching from GPT-4 to Claude requires re-evaluating token counts because tokenizers differ.
- ✕Ignoring tokenization at word boundaries—words split into multiple tokens can cause LLMs to 'see' parts of words rather than whole words, occasionally producing odd behavior with rare technical terms.
Related Terms
Large Language Model (LLM)
A large language model is a neural network trained on vast amounts of text that learns to predict and generate human-like text, enabling tasks like answering questions, writing, translation, and code generation.
Byte-Pair Encoding (BPE)
Byte-Pair Encoding (BPE) is the subword tokenization algorithm used by most LLMs to build their vocabulary by iteratively merging the most frequent adjacent byte or character pairs in training text.
Token
A token is the basic unit of text an LLM processes—roughly 4 characters or 3/4 of an English word. LLM APIs charge per token, context windows are measured in tokens, and generation speed is measured in tokens per second.
Context Length
Context length is the maximum number of tokens an LLM can process in a single request—encompassing the system prompt, conversation history, retrieved documents, and the response—determining how much information the model can consider simultaneously.
LLM API
An LLM API is a cloud service interface that provides programmatic access to large language models, allowing developers to send prompts and receive completions without managing model infrastructure.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →