Why non-English text costs more tokens: multilingual tokenization explained

A 100-word paragraph in English might use 130 tokens. The same paragraph translated to Chinese might use 300 tokens, Japanese 250, Hindi 350, and Arabic 280. This happens because most LLM tokenizers are trained predominantly on English text, so common English words become single tokens while words in other languages are split into multiple subword pieces.

Knowledge area
Model Comparison
Tokenization across model families
Topic focus
Multilingual tokens
multilingual

How this is calculated

The tokenization gap has real cost implications. A multilingual customer support chatbot that serves English, Spanish, Japanese, and Arabic users will pay 2-3x more per non-English interaction. Gemini and Claude tokenizers tend to handle non-English text more efficiently than GPT because they were trained on more balanced multilingual corpora. Llama's tokenizer is notably inefficient for Asian scripts because its training data skews toward English and Latin-script languages. If your application is primarily non-English, benchmark the actual token counts across models. The model with the lowest per-token price might actually cost more overall if its tokenizer produces 2x more tokens for your target language.

Verdict

Tokenization efficiency matters more for non-English applications. Benchmark actual token counts in your target language before choosing a model for cost reasons. Gemini and Claude often give more tokens-per-dollar for multilingual workloads despite higher per-token pricing.

More Tokens scenarios

Frequently asked questions

What is a token in an LLM?
A token is a chunk of text that a language model reads as a single unit. It is usually a common word, part of a longer word, or a piece of punctuation rather than a whole word or a single character. As a rough rule of thumb, one token is about four characters of English text, and 100 tokens is roughly 75 words.
How accurate is this token counter?
For OpenAI and Llama models the count is exact, because it uses the same Byte Pair Encoding tokenizers those models ship (o200k_base for GPT-5 and GPT-4o, cl100k_base for GPT-4 and GPT-3.5, and the Llama 3 tokenizer for Llama 3 and 4). For Claude and Gemini the count is a labelled estimate, since those tokenizers are not publicly available to run in the browser. Estimates are typically within about 10% of the real value.
Why do different models report different token counts?
Each model family is trained with its own tokenizer and vocabulary, so the same sentence can split into a different number of tokens depending on the model. Newer vocabularies like OpenAI's o200k_base are generally more efficient, packing more characters into each token, which lowers the count compared to older tokenizers.
Is my text sent to a server?
No. All tokenization and counting happens locally in your browser using a tokenizer that loads on the page. Nothing you type or paste is uploaded, logged, or stored, which makes the tool safe to use for private prompts and confidential text.