⚡">The MCP Context Tax — Notes from Running 605 Tool Packs — Pipeworx<br>If you’ve followed the MCP discourse for the last six months, you’ve seen the same pattern in every “best MCPs to install” listicle: a 20-server starter kit, each with 5-15 tools, sometimes more. The implicit promise is that more capability is strictly better. The agent has more reach; the developer has more options.
That promise is wrong, and we have a fairly precise view of why now.
Pipeworx runs a hosted MCP gateway with 605 tool packs behind one URL. The catalog spans about 2,761 individual tools across roughly 295 underlying data sources — SEC EDGAR, FRED, BLS, FDA, EPA, USPTO, Zillow, IMDB, Polymarket, and the long tail. In the last 30 days it served just over 4 million requests and roughly 1.5 million actual tool calls across about 22,000 unique monthly visitors. That gives us a reasonable sample to talk about what tool count actually does to agent behavior at the calling site.
The short version: tool count is a tax, not a feature. The right number of tools visible to an agent at any given moment is dramatically lower than the number it could theoretically reach. Mounting all of them up-front is the wrong default. The interesting question is how to surface the right subset.
The mechanics of the tax
Every MCP tool definition lives in the agent’s context window. The schema, the description, the parameter list, the type hints — all of it gets serialized into the system prompt every turn. There is no on-disk version the model consults later. If the agent is to know that a tool exists, the definition has to be present at inference time.
Tool descriptions are not small. A typical MCP tool description in our catalog runs 100 to 500 tokens once you include parameter schemas and type information. The richer ones — compound tools with detailed parameter narratives — can hit 1,000+ tokens individually. Mount 50 servers averaging 10 tools each and you’ve burned 50,000 to 100,000 tokens of context before the agent does anything. Mount 200 and the entire context budget can be saturated with tool descriptions on smaller models.
This sounds like a quantitative problem, but it actually produces qualitative degradation, and that’s the more interesting failure mode.
The behavioral cliff at ~50 tools
We see it consistently: past somewhere in the 40-60 visible tool range, tool-selection accuracy drops sharply. The model picks the wrong tool more often, hallucinates parameters more often, and — most surprisingly — misroutes prompts toward irrelevant tools that happen to share keyword surface. This isn’t a smooth degradation curve. There’s something cliff-shaped about it.
We’re not the first people to notice this. It’s been discussed on Reddit, in Anthropic’s own guidance on Claude Code MCP, and in various tooling writeups. What having ~4 million tool calls of telemetry has done for us is make it concrete: the cliff exists across model versions, doesn’t disappear by switching to a larger context window, and applies to both Anthropic and OpenAI native function-calling.
What it isn’t, importantly, is “the model can’t handle long context.” Anthropic and OpenAI both reliably handle hundreds of thousands of tokens of content. It’s specifically the tool selection task that degrades — the model is being asked to do attention-based discrimination across many semantically-similar option blocks, and that’s the operation that breaks down.
The implication: optimizing total context length is the wrong lens. Optimizing the size and similarity profile of the visible tool surface is the right one.
Long-tail tools die first
The second pattern is downstream of the first. On any given session, only a small subset of the mounted tool surface gets used. The rest sit in context, consume tokens, contribute to selection noise, and never get called.
Reddit anecdata captures this perfectly: “I installed 12 MCPs in week 1, kept 4 by week 2.” The four that survive are the boring high-frequency stuff — filesystem, repo search, docs lookup, browser control. The eight that get removed aren’t bad tools — they’re occasionally-useful tools that lose the visibility competition to the always-useful ones. When you put a one-off SEC EDGAR query tool next to filesystem in the same context, the model attends to filesystem and forgets EDGAR exists.
This produces a perverse outcome: long-tail tools are individually high-value (the user installed them for a reason) but get collectively deprecated by the architecture. The user removes them because they “never worked.” They worked fine, but their schemas got crowded out.
If your stack is built on the “mount everything” pattern, the rational response is to keep cutting until only the daily-driver tools are left. The long tail dies. Anything that wasn’t useful in the first 30 seconds gets evicted.
What we built to test the alternative
The Pipeworx gateway accepts query parameters that filter the visible tool surface per session....