Byte Pair Encoding: how LLM tokenizers learn to split text into tokens
Byte Pair Encoding (BPE) is the algorithm behind most modern LLM tokenizers, including OpenAI's tiktoken and Meta's Llama tokenizer. BPE starts with individual characters as tokens, then repeatedly merges the most frequent pair of adjacent tokens into a new token, building a vocabulary from the bottom up. After training on a large text corpus, common words become single tokens. Rare words stay split into subword pieces.
How this is calculated
BPE training works like this: start with every unique character in the training data as a token. Count all adjacent token pairs. Merge the most frequent pair into a new token. Repeat thousands of times until the desired vocabulary size is reached (e.g. 100K or 200K tokens). The resulting vocabulary handles any text, even words never seen during training, because any word can be broken into known subword tokens. This is why misspellings, code, and invented words still produce reasonable tokenization. BPE's main competitor is Unigram (used by SentencePiece in some configurations), which starts with a large vocabulary and prunes it. BPE is more common in the latest generation of models (GPT-5, Llama 4) because it handles whitespace and code more naturally.
Verdict
BPE is the dominant tokenization algorithm for a reason: it handles any text, any language, and any script without special cases. Understanding BPE helps you predict how a model will split your text and why some prompts cost more tokens than you'd expect based on character count.
More Tokens scenarios
Frequently asked questions
What is a token in an LLM?
How accurate is this token counter?
Why do different models report different token counts?
Is my text sent to a server?
Related tools
LLM API Pricing Calculator
Compare API costs across major models (OpenAI, Anthropic, Google) with prompt caching.
Use tool ➜LLM VRAM Calculator
Calculate the VRAM needed to run or fine-tune any LLM at any quantization.
Use tool ➜JSON Formatter
Validate, format, and minify JSON data with syntax highlighting.
Use tool ➜