What is a context window? How LLMs remember your conversation
An LLM's context window is its short-term memory. Every token in the current conversation (system prompt, user messages, assistant responses, tool calls) lives inside this window. When the window fills up, older tokens are dropped, and the model forgets them. This is why long conversations with an LLM sometimes lose track of earlier details.
How this is calculated
The context window includes both input and output tokens. If a model has a 128K context window and you send a 100K token document, you only have 28K tokens left for the model's response and any follow-up messages. The window is shared across the entire conversation. Techniques for managing context: summarize older messages when approaching the limit, use vector search (RAG) to inject only relevant information rather than dumping entire documents, and use the model's built-in prompt caching for content that repeats across messages. Some models also support context window extension via techniques like RoPE scaling, but this usually comes with a quality trade-off.
Verdict
The context window is the LLM's working memory. Respect its limits. For long documents, use RAG. For long conversations, summarize. For structured workflows, reset the context between independent tasks.
More Tokens scenarios
Related guides
Frequently asked questions
What is a token in an LLM?
How accurate is this token counter?
Why do different models report different token counts?
Is my text sent to a server?
Related tools
LLM API Pricing Calculator
Compare API costs across major models (OpenAI, Anthropic, Google) with prompt caching.
Use tool ➜LLM VRAM Calculator
Calculate the VRAM needed to run or fine-tune any LLM at any quantization.
Use tool ➜JSON Formatter
Validate, format, and minify JSON data with syntax highlighting.
Use tool ➜