Concept dossier

Why your chatbot starts to forget.

Every model has a whiteboard. Every message you send goes on it. The longer the conversation, the more there is to re-read — and the slower, more forgetful, and more expensive the model becomes. Welcome to the context window.

Made byDr Arpit Garg

An interactive deep dive~ 9 min read

ForAnyone who's noticed

FormatScroll, drag, type

scroll to begin

Chapter 01 · The whiteboard

Imagine the model has a whiteboard.

Every time you send a message, the model writes everything down — your message, the system instructions, the entire history of your conversation, any documents you've attached. Then, before answering, it re-reads the whole whiteboard. That whiteboard is the context window. It has a finite size, measured in tokens.²

Words per token
(English average)

Tokens in Claude's
standard window

Gemini 2.5's window —
marketing maximum⁹

A context window is the total amount of text the model processes in a single turn. Everything the model "sees" must fit inside it: the system prompt, your chat history, any documents you've attached, tool outputs, the current message, and the space reserved for the answer. Tokens beyond the limit are simply cut off. The model cannot read them.¹⁰

Think of it as the model's working memory. Within a single session, it works well — the model reasons over everything you've given it, refers back to earlier turns, and stays coherent across a long exchange. The trouble is not what happens within a single turn. The trouble is what happens as the window fills.

Try it. Fill the whiteboard.

Press send message to add to the conversation. Watch what happens.

Working memory · 8K tokens

Telemetry

0 tokens 0% full

Response latency 0.4 s

Cost so far $0.00

Memory health good

You've seen this with chatbots that start fast but get sluggish as conversations lengthen. Redis Engineering on context windows⁶

Chapter 02 · The math of slow

Double the chat. Quadruple the work.

Here's the brutal arithmetic. To respond, the model has to compare every token with every other token in the window. Ten tokens = a hundred comparisons. A hundred tokens = ten thousand. The mechanism that makes the model smart — attention — also makes it scale terribly.³

Drag to add tokens

Tokens in window 20

Tokens

grows linearly · O(n)

Comparisons

400

grows quadratically · O(n²)

A real conversation has tens of thousands of tokens — not 30. At 30,000 tokens that's 900 million comparisons. Per response. The bigger the whiteboard, the harder it gets to re-read.⁸

It gets worse on the hardware side. The model stores its work-so-far in a structure called the KV cache. Each new token adds to it. The cache grows. GPU memory eventually saturates. As Redis Engineering describes it: the GPU has fast small memory and slow large memory, and attention constantly shuffles data between them. Memory bandwidth becomes the bottleneck. That's the lag you feel.⁶

Chapter 03 · Tokens

The model doesn't see words. It sees pieces.

Before any text reaches the model, it's chopped into tokens — small chunks roughly three-quarters of a word long. Common words become a single token. Rare ones split into pieces. Punctuation counts. Numbers are weird. This is why "antidisestablishmentarianism" costs more than "the cat."

Tokenizer simulator

Approximate · type to see how text breaks down

Characters

Words

Tokens (approx)

A rough rule: one token ≈ 0.75 words in English. So 1,000 tokens is around 750 words — about three pages of paperback novel. A 200K-token window is around 150,000 words, or roughly a Harry Potter book. A 2M-token window is more than the entire Lord of the Rings trilogy. But — and this is the next chapter — being able to hold that much isn't the same as being able to use it.⁴

Chapter 04 · Lost in the middle

It remembers the start. It remembers the end. Not the middle.

In 2024, Stanford and Berkeley researchers published a paper that should have ended the context-size arms race. They showed that even when relevant information technically fits inside the window, models lose track of it when it's buried in the middle. They called it lost in the middle.⁹

The U-curve, lived

Move the slider to place a key fact at different positions in a long context.

Position of the needle 50%

Start Middle End Accuracy

82%

Retrieval accuracy
at this position

The takeaway: a 128K context window is not 128K tokens of uniform, reliable attention. The model has a strong recency bias (it remembers the most recent turn well) and a primacy bias (it remembers what came first, like the system prompt). The middle blurs. This is true even of frontier models — newer architectures mitigate it but haven't erased it.¹⁵

One 2025 study isolated the effect cleanly: padding a prompt with 30,000 irrelevant tokens caused one tested model to drop 67.6 points on MMLU — a major reasoning benchmark — purely from the input being longer. Not harder. Just longer.¹¹

You remember how conversations start and how they end. The middle part? Fuzzy. Aditya Kamat, on why models behave the same way¹⁵

Chapter 05 · The bill

Every word you send, you pay for.

Models bill per token, both in and out. The model doesn't charge less when it re-reads tokens it has already seen — every turn, every prior message, gets billed again. Long conversations compound. Drag the dials to see what happens.

Your conversation

Messages exchanged20

Avg message length (words)80

Attached docs (pages)0

Model price ($/M tokens)6

$3.84

Total cost of this conversation

Total tokens

21k

Latency / reply

1.2s

Window status

healthy

To put scale on it: a single request that fills a 1 million-token window costs around $2 in input tokens alone, before any answer is generated. Cost per token scales linearly even when the window is generous. Headroom is not free.¹⁶

Chapter 06 · The fixes

Good systems don't fight the window. They manage it.

Bigger windows aren't the answer. The math is what it is. Production systems work around the problem with four techniques. Each trades something away.

Technique 01

Compaction

When old turns aren't being referenced, summarize them. Replace 20 messages of back-and-forth with one sentence of "the user wanted X, we agreed Y." Saves tokens; loses some nuance.

Used by: ChatGPT, Claude long chats

Technique 02

Retrieval (RAG)

Keep your knowledge base outside the window. When the user asks something, look up only the most relevant snippets and inject those. The window stays small. The library can be vast.¹⁴

Used by: Most enterprise AI

Technique 03

Sliding window

Keep a rolling look-back of the last N turns; drop everything older. Like a goldfish with a usefully sized memory. Cheap, fast, but the model forgets — by design.²

Used by: lightweight chatbots

Technique 04

Dual-tier memory

Two memories. Working memory holds the current session. Long-term memory holds extracted facts the system pulls out over time. The Mem0 benchmark shows this cuts token usage by roughly 90% versus stuffing everything in.¹⁰

Used by: agent frameworks

Chapter 07 · The practical bit

Three things you can do, starting today.

Start fresh when the topic shifts.

If you've been chatting about a recipe and now want help debugging code, open a new conversation. The model isn't carrying a soul across the chat — it's re-reading the whiteboard. A whiteboard about pasta makes coding answers worse, slower, and more expensive.

ii.

Put the important stuff up front.

Because of the U-curve, the model attends best to the beginning and end of the context. Start your message with the key constraint or the most important document. Don't bury the lead in turn fourteen.¹⁴

iii.

Don't trust the marketing maximums.

A "1 million token context" is what the model accepts. Practical effective length is much lower — most long-context models show sharp drops past 32K. The RULER benchmark found effective length is often a fraction of what's advertised.⁶

A 128K context window is not 128K tokens of uniform, reliable attention. Mem0 on the limits of long context¹⁰

The shorthand for everything above: the model is not your friend with a perfect memory. It's a function with a whiteboard. The whiteboard fills up. When it does, the model gets slower, more expensive, and worse at retrieval — even when the answer is right there in front of it. Good engineering doesn't beg for a bigger whiteboard. It keeps the whiteboard tidy.

References

Sources, all open.

Field is moving fast. Every link goes to the original — most published in 2025 or 2026.

Kiruluta, Raju, Burity — Breaking Quadratic BarriersarXiv:2506.01963, UC Berkeley, Sept 2025.open ↗
Atlan — LLM Context Window Limitations in 2026Atlan blog, April 2026.open ↗
AI Agent Memory — LLM Context Window Attentionaiagentmemory.org, April 2026.open ↗
Tahir — Understanding LLM Context Windows: Tokens, Attention, ChallengesMedium, February 2025.open ↗
Unstructured — LLM Context Windows Explainedunstructured.io developer guide, 2025.open ↗
Redis — LLM context windows: what they are and how they workRedis blog, January 2026.open ↗
Demiliani, S. — Understanding LLM performance degradationdemiliani.com, November 2025.open ↗
Bahgat, O. — Context Windows Explained: The Math, Limits, FutureSubstack, 2025.open ↗
Reliable Data Engineering — Stop Chasing Million-Token Context WindowsMedium, December 2025.open ↗
Mem0 — Context Window vs Persistent Memory: Why 1M Tokens Isn't Enoughmem0.ai blog, April 2026.open ↗
Redis — Context Pruning: Cut LLM Tokens Without Losing QualityRedis blog, 2026.open ↗
Fast.io — How to Manage LLM Context Windows Effectivelyfast.io resources, February 2026.open ↗
Hakia — Context Windows Explained: Why Token Limits Matterhakia.com, January 2026.open ↗
Maxim — Context Window Management Strategies for AI Agentsgetmaxim.ai articles, 2026.open ↗
Kamat, A. — Why 400k tokens doesn't mean what you thinkMedium, October 2025.open ↗
Wikipedia — Context windowen.wikipedia.org, ongoing.open ↗
Elvex — Context Length Comparison: Leading AI Models in 2026elvex.com blog, March 2026.open ↗
DevTk.AI — LLM Context Windows Explained: 4K to 1M Tokensdevtk.ai blog, March 2026.open ↗