When More Context Makes LLM Agents Worse

Why More Context Can Make an LLM Worse

The default response to agent failure is to stuff more context into the prompt. The last five tool calls. The whole chat history. Three specification documents. Raw API responses. A full dump of the ticket thread. The assumption is obvious: more context means more information, and more information means better reasoning. That assumption is wrong often enough to deserve a name. I call it the Context Window Fallacy : the belief that increasing the number of tokens in view reliably improves model performance. In production systems, the opposite is frequently true. Past a threshold, extra context dilutes signal, blurs the boundary between instructions and data, and increases the probability that the model converges on a plausible but incomplete answer.

TL;DR - Key Takeaways:

A context window is not a hard drive. It is a volatile working surface where instructions, retrieved facts, tool output, and noise compete for attention.

Longer context introduces three structural problems: attention decay, control-boundary collapse, and premature convergence.

The right production pattern is not "keep adding tokens" but "budget, compress, and reconstruct" between steps.

The architectural answer is to break work into smaller state transitions instead of asking one giant prompt to do everything at once.

If your agent needs the entire history on every call, you probably have a state-modeling problem, not a context-window problem.

What the Fallacy Actually Is Large context windows are real capabilities. They make retrieval-heavy tasks possible. They reduce the need for aggressive truncation. They let a model compare two long documents in one pass. None of that implies that a model reasons better simply because more tokens are present. The hidden error is a category mistake. Teams treat the context window as storage when it behaves more like working memory. Storage preserves information. Working memory must allocate attention across competing inputs. Once the working surface is crowded, the question stops being "is the information present?" and becomes "does the model allocate enough attention to the right information at the right moment?" Those are different problems. This is not just a style preference. The Lost in the Middle line of work showed that models can miss relevant information in long contexts depending on where that information appears. The practical lesson is modest but important: presence in the prompt is not the same as reliable use. Figure 1: Added context can help early, but beyond the active working set it also increases interference and weakens control boundariesThat is why systems with large context windows still fail on seemingly simple tasks. The model is not blind. The relevant information is often somewhere in the prompt. The failure is allocation: too many tokens compete for the same limited control surface, and the model degrades from directed reasoning into token-weighted improvisation.

More context is not monotonic improvement. Once the active token budget is saturated, additional tokens behave less like knowledge and more like interference.

Why More Tokens Often Mean Less Thought First: attention decay. Transformer attention is not uniformly distributed across long inputs. In long-context retrieval tasks, relevant information positioned in the middle of the prompt is more likely to be missed than information placed near the beginning or end. The practical result is familiar: teams retrieve the right chunk, append it to a giant prompt, and then discover the model ignored it because the prompt already had too many competing anchors. Second: control-boundary collapse. The model does not experience your prompt as separate semantic layers. System instructions, user intent, scratchpad text, retrieved documents, and raw tool exhaust all enter as tokens. As the window grows, instruction hierarchy becomes less reliable. This is the same structural issue behind why prompts are not specifications: you are asking a statistical system to infer control boundaries that you did not encode explicitly. Third: premature convergence. A bloated context window tempts teams to ask the model to plan, reason, execute, evaluate, and summarize in one pass. That looks efficient on a whiteboard. In practice it increases the chance that the model settles on the first coherent-looking trajectory and stops doing the deeper work. The model produces something that sounds complete because the cheapest path through the token distribution is often "plausible summary," not "full reasoning trace." This is why large monolithic prompts often underperform smaller staged calls. The issue is not that the model lacks capacity. The issue is that the task surface has been flattened into one giant probabilistic step. Once you do that, the model has no explicit structure forcing it to separate recall, evaluation, and action. The Context Window Is Working Memory, Not Disk The...

When More Context Makes LLM Agents Worse

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast