Token efficiency: how to say more with fewer tokens and cut your LLM bill

Token efficiency is about maximizing the intelligence you get per token you spend. Wasted tokens are wasted money. A verbose system prompt with 500 tokens of polite filler costs the same as 500 tokens of detailed instructions, and the latter gets better results. The most common token wastes: redundant instructions repeated across messages, politeness padding, examples that are longer than necessary, and including entire documents when only a section is relevant.

Knowledge area
Best Practices
Practical token optimization advice
Topic focus
Token efficiency
token-efficiency

How this is calculated

Concrete efficiency tactics: use imperative mood ('Return a JSON array of...') instead of polite requests ('Could you please return...'), which saves 5-10 tokens per prompt with no quality loss. Put static instructions in the system prompt where they can be cached. Use the shortest example that demonstrates the pattern. For structured output, use the API's native structured output or JSON mode instead of describing the format in the prompt. Trim trailing whitespace, which some tokenizers count as tokens. For multi-turn conversations, prune the history aggressively: models rarely need more than the last 5-10 exchanges for context.

Verdict

Token efficiency compounds. Saving 50 tokens per request across 10,000 daily requests saves 500,000 tokens per day, which is real money at scale. Write for the model the way you'd write for a busy colleague: clear, direct, no fluff.

More Tokens scenarios

Frequently asked questions

What is a token in an LLM?
A token is a chunk of text that a language model reads as a single unit. It is usually a common word, part of a longer word, or a piece of punctuation rather than a whole word or a single character. As a rough rule of thumb, one token is about four characters of English text, and 100 tokens is roughly 75 words.
How accurate is this token counter?
For OpenAI and Llama models the count is exact, because it uses the same Byte Pair Encoding tokenizers those models ship (o200k_base for GPT-5 and GPT-4o, cl100k_base for GPT-4 and GPT-3.5, and the Llama 3 tokenizer for Llama 3 and 4). For Claude and Gemini the count is a labelled estimate, since those tokenizers are not publicly available to run in the browser. Estimates are typically within about 10% of the real value.
Why do different models report different token counts?
Each model family is trained with its own tokenizer and vocabulary, so the same sentence can split into a different number of tokens depending on the model. Newer vocabularies like OpenAI's o200k_base are generally more efficient, packing more characters into each token, which lowers the count compared to older tokenizers.
Is my text sent to a server?
No. All tokenization and counting happens locally in your browser using a tokenizer that loads on the page. Nothing you type or paste is uploaded, logged, or stored, which makes the tool safe to use for private prompts and confidential text.