Every model has a whiteboard. Every message you send goes on it. The longer the conversation, the more there is to re-read — and the slower, more forgetful, and more expensive the model becomes. Welcome to the context window.
Every time you send a message, the model writes everything down — your message, the system instructions, the entire history of your conversation, any documents you've attached. Then, before answering, it re-reads the whole whiteboard. That whiteboard is the context window. It has a finite size, measured in tokens.2
A context window is the total amount of text the model processes in a single turn. Everything the model "sees" must fit inside it: the system prompt, your chat history, any documents you've attached, tool outputs, the current message, and the space reserved for the answer. Tokens beyond the limit are simply cut off. The model cannot read them.10
Think of it as the model's working memory. Within a single session, it works well — the model reasons over everything you've given it, refers back to earlier turns, and stays coherent across a long exchange. The trouble is not what happens within a single turn. The trouble is what happens as the window fills.
Press send message to add to the conversation. Watch what happens.
You've seen this with chatbots that start fast but get sluggish as conversations lengthen.Redis Engineering on context windows6
Here's the brutal arithmetic. To respond, the model has to compare every token with every other token in the window. Ten tokens = a hundred comparisons. A hundred tokens = ten thousand. The mechanism that makes the model smart — attention — also makes it scale terribly.3
A real conversation has tens of thousands of tokens — not 30. At 30,000 tokens that's 900 million comparisons. Per response. The bigger the whiteboard, the harder it gets to re-read.8
It gets worse on the hardware side. The model stores its work-so-far in a structure called the KV cache. Each new token adds to it. The cache grows. GPU memory eventually saturates. As Redis Engineering describes it: the GPU has fast small memory and slow large memory, and attention constantly shuffles data between them. Memory bandwidth becomes the bottleneck. That's the lag you feel.6
Before any text reaches the model, it's chopped into tokens — small chunks roughly three-quarters of a word long. Common words become a single token. Rare ones split into pieces. Punctuation counts. Numbers are weird. This is why "antidisestablishmentarianism" costs more than "the cat."
Approximate · type to see how text breaks down
A rough rule: one token ≈ 0.75 words in English. So 1,000 tokens is around 750 words — about three pages of paperback novel. A 200K-token window is around 150,000 words, or roughly a Harry Potter book. A 2M-token window is more than the entire Lord of the Rings trilogy. But — and this is the next chapter — being able to hold that much isn't the same as being able to use it.4
In 2024, Stanford and Berkeley researchers published a paper that should have ended the context-size arms race. They showed that even when relevant information technically fits inside the window, models lose track of it when it's buried in the middle. They called it lost in the middle.9
Move the slider to place a key fact at different positions in a long context.
The takeaway: a 128K context window is not 128K tokens of uniform, reliable attention. The model has a strong recency bias (it remembers the most recent turn well) and a primacy bias (it remembers what came first, like the system prompt). The middle blurs. This is true even of frontier models — newer architectures mitigate it but haven't erased it.15
One 2025 study isolated the effect cleanly: padding a prompt with 30,000 irrelevant tokens caused one tested model to drop 67.6 points on MMLU — a major reasoning benchmark — purely from the input being longer. Not harder. Just longer.11
You remember how conversations start and how they end. The middle part? Fuzzy.Aditya Kamat, on why models behave the same way15
Models bill per token, both in and out. The model doesn't charge less when it re-reads tokens it has already seen — every turn, every prior message, gets billed again. Long conversations compound. Drag the dials to see what happens.
To put scale on it: a single request that fills a 1 million-token window costs around $2 in input tokens alone, before any answer is generated. Cost per token scales linearly even when the window is generous. Headroom is not free.16
Bigger windows aren't the answer. The math is what it is. Production systems work around the problem with four techniques. Each trades something away.
If you've been chatting about a recipe and now want help debugging code, open a new conversation. The model isn't carrying a soul across the chat — it's re-reading the whiteboard. A whiteboard about pasta makes coding answers worse, slower, and more expensive.
Because of the U-curve, the model attends best to the beginning and end of the context. Start your message with the key constraint or the most important document. Don't bury the lead in turn fourteen.14
A "1 million token context" is what the model accepts. Practical effective length is much lower — most long-context models show sharp drops past 32K. The RULER benchmark found effective length is often a fraction of what's advertised.6
A 128K context window is not 128K tokens of uniform, reliable attention.Mem0 on the limits of long context10
The shorthand for everything above: the model is not your friend with a perfect memory. It's a function with a whiteboard. The whiteboard fills up. When it does, the model gets slower, more expensive, and worse at retrieval — even when the answer is right there in front of it. Good engineering doesn't beg for a bigger whiteboard. It keeps the whiteboard tidy.
Field is moving fast. Every link goes to the original — most published in 2025 or 2026.