Large Language Models (LLMs)

Tokenization

Definition

Tokenization is the process of splitting input text into discrete units called tokens that an LLM can process. Modern LLMs use subword tokenization: common words are single tokens ('the', 'is', 'hello'), common prefixes and suffixes are tokens ('un-', '-ing', '-tion'), and rare words are split into multiple subword tokens ('transformer' might be ['transform', 'er']). A typical LLM vocabulary contains 50,000-100,000+ tokens. Text is tokenized before being converted to embeddings for processing, and the model outputs token IDs that are decoded back to text. Tokenization is model-specific—the same text tokenized with GPT-4's tokenizer and Llama-3's tokenizer will produce different token sequences.

Why It Matters

Tokenization directly affects API costs (billed per token), context window usage (longer texts consume more tokens), and LLM performance on different content types. Non-English text, code, and technical jargon tokenize less efficiently—a word in Turkish or Korean may require 3-5 tokens where an equivalent English word requires 1-2. This means the same concept expressed in different languages can have very different costs and context window consumption. For 99helpers customers building multilingual chatbots or processing code-heavy documentation, understanding tokenization efficiency helps optimize context window usage and control API costs.

How It Works

Most LLM providers use Byte-Pair Encoding (BPE) or derivatives. BPE starts with a vocabulary of individual characters, then iteratively merges the most frequent adjacent pairs into new tokens. After many merge operations, common subwords, words, and phrases become single tokens. At tokenization time, the algorithm greedily applies the longest matching token from the vocabulary. Tools like OpenAI's tiktoken library let developers count tokens before making API calls. A rough rule of thumb: 1 token ≈ 4 characters or 0.75 English words. 1,000 tokens ≈ 750 words or about 1.5 pages of text.

Tokenization Pipeline — Text → Token IDs → Embeddings

1. Raw Text

LLMs tokenize text efficiently.
BPE Tokenizer

2. BPE Tokens

LL
Ms
token
ize
text
efficiently
.
Vocabulary lookup

3. Token IDs

4720
5354
3263
1096
1366
14467
13
Embedding matrix lookup

4. Embedding Vectors

First token "LL" → vector (d=8 shown, real d=4096+)

0.82
-0.31
0.14
0.67
-0.55
0.09
0.43
-0.22

BPE (Byte-Pair Encoding) merges the most frequent character pairs iteratively to build a vocabulary of ~50K–100K subword tokens that covers all text efficiently.

Real-World Example

A 99helpers developer builds a feature that processes customer-uploaded PDFs. Before implementing, they use tiktoken to estimate token counts: a 10-page PDF typically contains ~5,000-7,000 tokens. With GPT-4o's 128K context limit, they can process 18-25 pages per request. For pricing calculations: at $0.0025 per 1K input tokens, a 5,000-token document costs $0.0125 to process—reasonable for single documents, but 100 documents/day costs $1.25/day or $37.50/month just for document ingestion, informing their cost model.

Common Mistakes

  • Counting words instead of tokens for context window and cost estimation—tokens and words have a 4:3 ratio on average for English but diverge significantly for code, non-Latin scripts, and technical content.
  • Assuming the same tokenizer across all models—switching from GPT-4 to Claude requires re-evaluating token counts because tokenizers differ.
  • Ignoring tokenization at word boundaries—words split into multiple tokens can cause LLMs to 'see' parts of words rather than whole words, occasionally producing odd behavior with rare technical terms.

Related Terms

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →
What is Tokenization? Tokenization Definition & Guide | 99helpers | 99helpers.com