What are tokens in LLMs? How language models break text into pieces
A token is the atomic unit of text that a language model processes. It's not quite a word and not quite a character. Common words like 'the' are usually one token each. Longer or less common words like 'tokenization' might be split into two or three tokens. As a rule of thumb, one token is roughly 4 characters of English text, or 100 tokens is roughly 75 words.
How this is calculated
Tokens exist because language models don't understand text directly. They understand sequences of numbers. A tokenizer converts text into tokens, and each token maps to a numeric ID in the model's vocabulary. The model processes these IDs, not the raw text. Different model families use different tokenizers with different vocabularies, which is why the same sentence can produce a different token count for GPT-5 vs Claude vs Llama 4. OpenAI's o200k_base tokenizer (used by GPT-5 and GPT-4o) has a vocabulary of 200,000 tokens and is more efficient than older tokenizers like cl100k_base, typically producing fewer tokens for the same input.
Verdict
Tokens are how LLMs see text. Understanding them helps you estimate costs, stay inside context windows, and write better prompts. The 4-chars-per-token rule of thumb is good enough for back-of-the-envelope estimates. For exact counts, use a tokenizer.
More Tokens scenarios
Frequently asked questions
What is a token in an LLM?
How accurate is this token counter?
Why do different models report different token counts?
Is my text sent to a server?
Related tools
LLM API Pricing Calculator
Compare API costs across major models (OpenAI, Anthropic, Google) with prompt caching.
Use tool ➜LLM VRAM Calculator
Calculate the VRAM needed to run or fine-tune any LLM at any quantization.
Use tool ➜JSON Formatter
Validate, format, and minify JSON data with syntax highlighting.
Use tool ➜