What is a context window? How LLMs remember your conversation

An LLM's context window is its short-term memory. Every token in the current conversation (system prompt, user messages, assistant responses, tool calls) lives inside this window. When the window fills up, older tokens are dropped, and the model forgets them. This is why long conversations with an LLM sometimes lose track of earlier details.

Knowledge area
Fundamentals
How tokens and tokenization work
Topic focus
Context window explained
context-window

How this is calculated

The context window includes both input and output tokens. If a model has a 128K context window and you send a 100K token document, you only have 28K tokens left for the model's response and any follow-up messages. The window is shared across the entire conversation. Techniques for managing context: summarize older messages when approaching the limit, use vector search (RAG) to inject only relevant information rather than dumping entire documents, and use the model's built-in prompt caching for content that repeats across messages. Some models also support context window extension via techniques like RoPE scaling, but this usually comes with a quality trade-off.

Verdict

The context window is the LLM's working memory. Respect its limits. For long documents, use RAG. For long conversations, summarize. For structured workflows, reset the context between independent tasks.

More Tokens scenarios

Frequently asked questions

What is a token in an LLM?
A token is a chunk of text that a language model reads as a single unit. It is usually a common word, part of a longer word, or a piece of punctuation rather than a whole word or a single character. As a rough rule of thumb, one token is about four characters of English text, and 100 tokens is roughly 75 words.
How accurate is this token counter?
For OpenAI and Llama models the count is exact, because it uses the same Byte Pair Encoding tokenizers those models ship (o200k_base for GPT-5 and GPT-4o, cl100k_base for GPT-4 and GPT-3.5, and the Llama 3 tokenizer for Llama 3 and 4). For Claude and Gemini the count is a labelled estimate, since those tokenizers are not publicly available to run in the browser. Estimates are typically within about 10% of the real value.
Why do different models report different token counts?
Each model family is trained with its own tokenizer and vocabulary, so the same sentence can split into a different number of tokens depending on the model. Newer vocabularies like OpenAI's o200k_base are generally more efficient, packing more characters into each token, which lowers the count compared to older tokenizers.
Is my text sent to a server?
No. All tokenization and counting happens locally in your browser using a tokenizer that loads on the page. Nothing you type or paste is uploaded, logged, or stored, which makes the tool safe to use for private prompts and confidential text.