Why non-English text costs more tokens: multilingual tokenization explained
A 100-word paragraph in English might use 130 tokens. The same paragraph translated to Chinese might use 300 tokens, Japanese 250, Hindi 350, and Arabic 280. This happens because most LLM tokenizers are trained predominantly on English text, so common English words become single tokens while words in other languages are split into multiple subword pieces.
How this is calculated
The tokenization gap has real cost implications. A multilingual customer support chatbot that serves English, Spanish, Japanese, and Arabic users will pay 2-3x more per non-English interaction. Gemini and Claude tokenizers tend to handle non-English text more efficiently than GPT because they were trained on more balanced multilingual corpora. Llama's tokenizer is notably inefficient for Asian scripts because its training data skews toward English and Latin-script languages. If your application is primarily non-English, benchmark the actual token counts across models. The model with the lowest per-token price might actually cost more overall if its tokenizer produces 2x more tokens for your target language.
Verdict
Tokenization efficiency matters more for non-English applications. Benchmark actual token counts in your target language before choosing a model for cost reasons. Gemini and Claude often give more tokens-per-dollar for multilingual workloads despite higher per-token pricing.
More Tokens scenarios
Frequently asked questions
What is a token in an LLM?
How accurate is this token counter?
Why do different models report different token counts?
Is my text sent to a server?
Related tools
LLM API Pricing Calculator
Compare API costs across major models (OpenAI, Anthropic, Google) with prompt caching.
Use tool ➜LLM VRAM Calculator
Calculate the VRAM needed to run or fine-tune any LLM at any quantization.
Use tool ➜JSON Formatter
Validate, format, and minify JSON data with syntax highlighting.
Use tool ➜