What are tokens in LLMs? How language models break text into pieces

A token is the atomic unit of text that a language model processes. It's not quite a word and not quite a character. Common words like 'the' are usually one token each. Longer or less common words like 'tokenization' might be split into two or three tokens. As a rule of thumb, one token is roughly 4 characters of English text, or 100 tokens is roughly 75 words.

Knowledge area
Fundamentals
How tokens and tokenization work
Topic focus
What are tokens
what-are-tokens

How this is calculated

Tokens exist because language models don't understand text directly. They understand sequences of numbers. A tokenizer converts text into tokens, and each token maps to a numeric ID in the model's vocabulary. The model processes these IDs, not the raw text. Different model families use different tokenizers with different vocabularies, which is why the same sentence can produce a different token count for GPT-5 vs Claude vs Llama 4. OpenAI's o200k_base tokenizer (used by GPT-5 and GPT-4o) has a vocabulary of 200,000 tokens and is more efficient than older tokenizers like cl100k_base, typically producing fewer tokens for the same input.

Verdict

Tokens are how LLMs see text. Understanding them helps you estimate costs, stay inside context windows, and write better prompts. The 4-chars-per-token rule of thumb is good enough for back-of-the-envelope estimates. For exact counts, use a tokenizer.

More Tokens scenarios

Frequently asked questions

What is a token in an LLM?
A token is a chunk of text that a language model reads as a single unit. It is usually a common word, part of a longer word, or a piece of punctuation rather than a whole word or a single character. As a rough rule of thumb, one token is about four characters of English text, and 100 tokens is roughly 75 words.
How accurate is this token counter?
For OpenAI and Llama models the count is exact, because it uses the same Byte Pair Encoding tokenizers those models ship (o200k_base for GPT-5 and GPT-4o, cl100k_base for GPT-4 and GPT-3.5, and the Llama 3 tokenizer for Llama 3 and 4). For Claude and Gemini the count is a labelled estimate, since those tokenizers are not publicly available to run in the browser. Estimates are typically within about 10% of the real value.
Why do different models report different token counts?
Each model family is trained with its own tokenizer and vocabulary, so the same sentence can split into a different number of tokens depending on the model. Newer vocabularies like OpenAI's o200k_base are generally more efficient, packing more characters into each token, which lowers the count compared to older tokenizers.
Is my text sent to a server?
No. All tokenization and counting happens locally in your browser using a tokenizer that loads on the page. Nothing you type or paste is uploaded, logged, or stored, which makes the tool safe to use for private prompts and confidential text.