LLM context window limits: maximum tokens for every major model in 2026

Context window size is the maximum number of tokens a model can process in a single request, including both input and output. In 2026, leading models range from 128K tokens (GPT-5.4 Mini) to 2 million tokens (Gemini 3.1 Pro and 3.5 Flash). Bigger isn't always better: long context windows cost more, run slower, and models can lose attention to details in the middle of very long inputs.

Knowledge area
Model Comparison
Tokenization across model families
Topic focus
Token limits by model
token-limits

How this is calculated

GPT-5.4 (OpenAI): 256K tokens. GPT-5.4 Mini: 128K. GPT-5.5: 256K. Claude Opus 4.7 (Anthropic): 500K. Claude Sonnet 4.6: 200K. Gemini 3.1 Pro (Google): 2M. Gemini 3.5 Flash: 2M. Llama 4 (Meta): 128K (open-weight). DeepSeek V4 Pro: 1M. The trend is upward: a 2M token context window can fit roughly 1.5 million words, equivalent to the entire Lord of the Rings trilogy in a single prompt. But effective context utilization drops as context length increases. The 'lost in the middle' problem means models often ignore content in the middle of long contexts. For most use cases, 128K tokens is more than enough.

Verdict

Context window size is a spec, not a guarantee of quality. A 2M token window lets you submit 2M tokens. Whether the model pays attention to all of them is a different question. For production applications, benchmark your specific task at multiple context lengths. You may find that 64K tokens performs better than 256K.

More Tokens scenarios

Frequently asked questions

What is a token in an LLM?
A token is a chunk of text that a language model reads as a single unit. It is usually a common word, part of a longer word, or a piece of punctuation rather than a whole word or a single character. As a rough rule of thumb, one token is about four characters of English text, and 100 tokens is roughly 75 words.
How accurate is this token counter?
For OpenAI and Llama models the count is exact, because it uses the same Byte Pair Encoding tokenizers those models ship (o200k_base for GPT-5 and GPT-4o, cl100k_base for GPT-4 and GPT-3.5, and the Llama 3 tokenizer for Llama 3 and 4). For Claude and Gemini the count is a labelled estimate, since those tokenizers are not publicly available to run in the browser. Estimates are typically within about 10% of the real value.
Why do different models report different token counts?
Each model family is trained with its own tokenizer and vocabulary, so the same sentence can split into a different number of tokens depending on the model. Newer vocabularies like OpenAI's o200k_base are generally more efficient, packing more characters into each token, which lowers the count compared to older tokenizers.
Is my text sent to a server?
No. All tokenization and counting happens locally in your browser using a tokenizer that loads on the page. Nothing you type or paste is uploaded, logged, or stored, which makes the tool safe to use for private prompts and confidential text.