How to reduce token usage: practical prompt optimization that saves money

Prompt optimization is the practice of getting the same or better results from an LLM with fewer input tokens. Since most API pricing charges per input token, trimming your prompts directly reduces your bill. A 30% reduction in token usage means a 30% reduction in cost, with no change to the model or provider.

Knowledge area
Best Practices
Practical token optimization advice
Topic focus
Prompt optimization
prompt-optimization

How this is calculated

Effective optimization techniques: remove politeness padding ('please', 'could you', 'I would like you to'), which can add 10-15% to token counts. Use system prompts for persistent instructions rather than repeating them in every user message. Compress few-shot examples to the minimum that demonstrates the pattern. Use the prompt cache feature (available on OpenAI, Anthropic, and DeepSeek) for static content like system prompts and few-shot examples; cached tokens cost 50-90% less. For structured output, use the model's native JSON mode or structured output feature rather than lengthy formatting instructions in the prompt. For multi-turn conversations, prune conversation history aggressively.

Verdict

Prompt optimization is the fastest way to cut LLM costs. Start by removing politeness language and moving instructions to the system prompt. Then enable prompt caching. These two changes alone typically reduce costs by 30% or more.

More Tokens scenarios

Frequently asked questions

What is a token in an LLM?
A token is a chunk of text that a language model reads as a single unit. It is usually a common word, part of a longer word, or a piece of punctuation rather than a whole word or a single character. As a rough rule of thumb, one token is about four characters of English text, and 100 tokens is roughly 75 words.
How accurate is this token counter?
For OpenAI and Llama models the count is exact, because it uses the same Byte Pair Encoding tokenizers those models ship (o200k_base for GPT-5 and GPT-4o, cl100k_base for GPT-4 and GPT-3.5, and the Llama 3 tokenizer for Llama 3 and 4). For Claude and Gemini the count is a labelled estimate, since those tokenizers are not publicly available to run in the browser. Estimates are typically within about 10% of the real value.
Why do different models report different token counts?
Each model family is trained with its own tokenizer and vocabulary, so the same sentence can split into a different number of tokens depending on the model. Newer vocabularies like OpenAI's o200k_base are generally more efficient, packing more characters into each token, which lowers the count compared to older tokenizers.
Is my text sent to a server?
No. All tokenization and counting happens locally in your browser using a tokenizer that loads on the page. Nothing you type or paste is uploaded, logged, or stored, which makes the tool safe to use for private prompts and confidential text.