Byte Pair Encoding: how LLM tokenizers learn to split text into tokens

Byte Pair Encoding (BPE) is the algorithm behind most modern LLM tokenizers, including OpenAI's tiktoken and Meta's Llama tokenizer. BPE starts with individual characters as tokens, then repeatedly merges the most frequent pair of adjacent tokens into a new token, building a vocabulary from the bottom up. After training on a large text corpus, common words become single tokens. Rare words stay split into subword pieces.

Knowledge area
Fundamentals
How tokens and tokenization work
Topic focus
BPE tokenization
bpe

How this is calculated

BPE training works like this: start with every unique character in the training data as a token. Count all adjacent token pairs. Merge the most frequent pair into a new token. Repeat thousands of times until the desired vocabulary size is reached (e.g. 100K or 200K tokens). The resulting vocabulary handles any text, even words never seen during training, because any word can be broken into known subword tokens. This is why misspellings, code, and invented words still produce reasonable tokenization. BPE's main competitor is Unigram (used by SentencePiece in some configurations), which starts with a large vocabulary and prunes it. BPE is more common in the latest generation of models (GPT-5, Llama 4) because it handles whitespace and code more naturally.

Verdict

BPE is the dominant tokenization algorithm for a reason: it handles any text, any language, and any script without special cases. Understanding BPE helps you predict how a model will split your text and why some prompts cost more tokens than you'd expect based on character count.

More Tokens scenarios

Frequently asked questions

What is a token in an LLM?
A token is a chunk of text that a language model reads as a single unit. It is usually a common word, part of a longer word, or a piece of punctuation rather than a whole word or a single character. As a rough rule of thumb, one token is about four characters of English text, and 100 tokens is roughly 75 words.
How accurate is this token counter?
For OpenAI and Llama models the count is exact, because it uses the same Byte Pair Encoding tokenizers those models ship (o200k_base for GPT-5 and GPT-4o, cl100k_base for GPT-4 and GPT-3.5, and the Llama 3 tokenizer for Llama 3 and 4). For Claude and Gemini the count is a labelled estimate, since those tokenizers are not publicly available to run in the browser. Estimates are typically within about 10% of the real value.
Why do different models report different token counts?
Each model family is trained with its own tokenizer and vocabulary, so the same sentence can split into a different number of tokens depending on the model. Newer vocabularies like OpenAI's o200k_base are generally more efficient, packing more characters into each token, which lowers the count compared to older tokenizers.
Is my text sent to a server?
No. All tokenization and counting happens locally in your browser using a tokenizer that loads on the page. Nothing you type or paste is uploaded, logged, or stored, which makes the tool safe to use for private prompts and confidential text.