--- title: Tokens and Context Windows: Why AI Forgets Things description: What tokens really are, why context windows limit AI memory, and why your AI assistant loses track of conversations. A look at the mechanics behind AI forgetting. date: February 5, 2026 author: Robert Soares category: ai-fundamentals --- Somewhere around message fifteen, the AI assistant stops recognizing your project. You remind it. It apologizes. Two messages later, it forgets again. This is not a bug. This is architecture. ## Words Are Not What AI Reads Open a book. You see words. An AI model sees something different: chunks called tokens. A token might be a complete word, but often it is not. The word "hamburger" splits into three pieces. "Ham." "Bur." "Ger." Each piece is a separate token that the model processes independently, even though your brain sees one word. This splitting happens through a process called tokenization, and different models use different approaches. The specific algorithm matters more than most people realize. As Simon Willison observed in his analysis of GPT tokenizers: "Many of the quirks and limitations of LLMs can be traced back to details of the tokenizer used." Common words survive intact. Rare words get sliced up. Technical jargon, names, non-English text? Chopped into fragments. Here is where it gets interesting. The word "Tokenization" itself splits into two tokens: token 30,642 and token 1,634. The AI does not see it as one unit. It sees two pieces that learned to go together during training, the same way you learned that "ham," "bur," and "ger" spell a sandwich. Languages matter too. English tokenizes efficiently because these systems were trained primarily on English text. Spanish, Chinese, Arabic? They all produce more tokens per word. The phrase "Cómo estás" uses 5 tokens for just 10 characters, which means non-English speakers hit limits faster while saying less. ## The Tokenizer's Weird Memory Tokenizers remember things from their training data in strange ways. Willison noted an interesting bias: "The English bias is obvious here. ' man' gets a lower token ID of 582, because it's an English word." Lower token IDs generally correspond to more common tokens. The model essentially has favorites. Then there are glitch tokens. During tokenizer training, certain patterns appeared so frequently they became their own tokens, even when they should not have been. One example is " davidjl" (with the leading space), which became its own token because that username appeared hundreds of thousands of times in the GPT-2 training data. The old tokenizer encoded "SolidGoldMagikarp" as a single token because of similar statistical accidents. The new tokenizer breaks it into five: "Solid," "Gold," "Mag," "ik," "arp." These are not just curiosities. They reveal that tokenization is not a neutral translation layer. It carries the biases and accidents of its training data into every conversation you have with an AI. ## Context Windows: The Invisible Walls Every AI model has a context window. This is the maximum number of tokens it can hold at once. Your messages, the AI's responses, any documents you upload, the system prompt running in the background: all of it must fit inside this window. The numbers have grown dramatically. GPT-4 Turbo offers 128,000 tokens. Claude gives you 200,000 tokens. Gemini 2.5 Pro pushes to 1 million. Meta's Llama 4 Scout claims 10 million. A million tokens sounds like infinity. It is roughly 750,000 words. Several novels. An entire codebase. So why does your AI forget what you told it twenty minutes ago? ## Bigger Windows, Same Problems On Hacker News, a user named jokethrowaway cut to the core issue: "Context window size is not the limiting factor. How well will it be able to use that information is the problem." Having space is not the same as using space well. Research from Stanford demonstrated what they called the "lost in the middle" effect. AI models show a U-shaped attention curve. They attend well to information at the beginning of the context. They attend well to information at the end. The middle? It fades. In experiments, GPT-3.5-Turbo's performance dropped more than 20% when key information was placed in the middle of the input rather than at the beginning or end. Sometimes the model performed worse than if it had no context at all. Having the information and using it are different things. This is not a software bug that will be patched next Tuesday. It emerges from the attention mechanism that makes transformers work in the first place, the mathematical process that allows the model to understand which parts of the input relate to which other parts. That mechanism naturally favors certain positions. The architecture has opinions about what matters. ## The Gap Between Claims and Reality Research from Chroma examined what happens as models approach their advertised limits. The finding: "most models break much earlier than advertised. A model claiming 200k tokens typically becomes unreliable around 130k, with sudden performance drops rather than gradual degradation." Models do not gracefully fade. They work, then they do not. The cliff is steep. On the OpenAI developer forums, users have documented this experience repeatedly. One user named rajeev.a.j.madari described the frustration: "ChatGPT struggles to remember the entirety of our chat. Most times, it appears as though the system only acknowledges my most recent input, causing confusion." Another user, Joel_Barger, noted practical consequences: "In a coding situation context is important. It'll lose or change the name of namespaces or class methods arbitrarily." These are not edge cases. This is the normal experience of long conversations with AI models. ## Compute Scales, Money Burns Making context windows bigger is expensive. A user named gdiamos explained the economics on Hacker News: "the compute still scales at best linearly with the input size. So a context size of 100k requires 100x more compute than a prompt size of 1k." But it is actually worse than linear. The attention mechanism scales quadratically with sequence length. Double the context, quadruple the compute. This is why longer context windows cost more per token. This is why free tiers have shorter limits. This is why your enterprise plan still cuts you off eventually. Various techniques mitigate this. Sparse attention patterns skip connections between distant tokens. Sliding window approaches process chunks separately. Architectural innovations compress older context into summaries. But each solution trades something: speed, accuracy, or the ability to connect ideas across long distances. ## Why "Memory" Features Do Not Solve This Modern AI assistants advertise memory features. ChatGPT will remember that you prefer concise responses. Claude can store facts about your projects across conversations. This is not the same as context. These memory systems store specific facts in a separate database. When you start a new conversation, the AI retrieves relevant memories and inserts them into the context window. It is retrieval, not actual remembering. The difference matters because retrieval is selective. The system guesses which stored facts matter for this conversation. It guesses wrong sometimes. And even when it guesses right, those retrieved memories still compete for space in the same limited context window as everything else. As segmondy noted on Hacker News: "infinite context window is not AGI enough, memory is not substitute for planning and reasoning." Storing facts is not the same as understanding them. Remembering that you mentioned a deadline last Tuesday is not the same as tracking how that deadline interacts with the three other constraints you mentioned this Tuesday. ## Position Is Strategy If you understand how context windows work, you can work with them instead of against them. Put critical information first. The model pays attention to the beginning. Do not warm up with backstory and save the important constraints for paragraph six. Lead with what matters. Repeat yourself strategically. If something was crucial in message three and you are now on message thirty, say it again. The model will not be offended. It probably does not remember anyway. Keep conversations focused. A context window shared across fifteen different topics is worse than three separate conversations about five topics each. Specificity beats sprawl. Summarize periodically. When a conversation gets long, ask the AI to summarize the key points, then start a new conversation with that summary as the first message. You lose nuance but gain clarity. ## The Strange Future Context windows keep growing. The research community keeps finding ways to push limits. We went from 4,000 tokens to 10 million in a few years. That trajectory seems likely to continue. But bigger is not the same as better, and the fundamental challenges remain architectural. fsndz observed on Hacker News: "Context windows are becoming larger and larger, and I anticipate more research focusing on this trend." The research exists because the problem exists. There is something almost poetic about building systems that forget. Human memory is imperfect too. We lose the middle of lectures. We remember beginnings and endings. We reconstruct instead of recall. The AI does not mimic human memory by design. It arrives at similar limitations through completely different mechanisms. Different architectures, same result: things get lost. You tell the AI about your project. It responds helpfully. You continue the conversation. Somewhere around message fifteen, you notice that helpful response has drifted. The AI is still answering. It is still confident. It has simply forgotten what you were actually talking about. This is not malice. This is math. And until the math changes, every conversation with an AI carries an invisible countdown.