When your agent extensions fight each other - Microsoft for Developers
Skip to main content
Search<br>Search
No results
Cancel
Waldek Mastykarz
Principal Developer Advocate
This is the fourth article in a series about Agent Experience (AX): the practice of making AI coding agents work correctly with your technology. The series covers what you can and can’t control in the agent stack, how to measure whether your extensions are helping or hurting, and how to iterate toward better outcomes.
You shipped your extension, measured it, confirmed it creates lift. Then a developer installs it alongside 14 other extensions, and outcomes get worse. Not because your extension is broken, but because extensions don’t exist in isolation: they compete.
In the previous article, we covered how to measure whether a single extension creates lift or drag. That measurement assumed a clean environment: your extension, and nothing else. Real developer workspaces don’t look like that. Real workspaces have MCP servers for their cloud provider, instruction files from their team, skills from their framework, and whatever’s trending on social this week. The question really isn’t whether your extension works, it’s whether it works well with others.
More extensions, worse outcomes
It seems like extensions should be additive: extension A improves authentication code, extension B improves database queries. Install both, get better auth and better queries. Right? Not quite.
Every extension consumes tokens in the context window just by being installed. Tool descriptions and skill definitions take up space before you even type a prompt. A single MCP server with 8 tools might add 2,000 tokens of tool descriptions. Five MCP servers? That’s 10,000 tokens of tool descriptions alone, and you haven’t asked anything yet. The context window is a fixed budget. Every token spent on tool descriptions is a token not spent on workspace context or conversation history. Install enough extensions and the harness starts making cuts: summarizing tool descriptions, dropping tools it deems irrelevant, and truncating whatever doesn’t fit. Your carefully written tool description might get compressed into something the model can barely interpret.
But it’s not just about space. Every token in the context changes the path the model takes. The model doesn’t process your extension’s content in isolation and then process the next one separately. It attends to everything at once. Add a database extension next to your auth extension, and both extensions produce different output than either would alone, even if they have nothing to do with each other. The mere presence of unrelated tool descriptions shifts the model’s attention and changes the output.
We’ve measured this directly: the same extension that produces consistent lift in isolation can produce drag with even one other extension present. Not because the extension changed, but because the environment around it did.
Three ways extensions conflict
Not all composition problems are about token budgets, though. Extensions can actively interfere with each other in ways that are harder to spot.
Vocabulary collisions
Two extensions describe their tools using similar language. Your tool says "manage authentication settings", another tool says "configure identity and access." The developer asks "set up auth for my app." Both tools match the intent, the model picks one, and it might not be yours.
How would you even debug that? The developer sees wrong output and blames the model. But it’s not a bug in either extension, it’s an emergent property of putting two tools with overlapping vocabulary into the same context. The model doesn’t know which one the developer means, and it can’t ask (most harnesses don’t support clarifying questions for tool selection). It makes a choice based on factors you can’t see such as context position and training data associations. What’s more, that choice might change between runs.
Guidance conflicts
Extensions inject guidance through tool descriptions and skill definitions. The agent also picks up instruction files (.github/copilot-instructions.md, AGENTS.md, .instructions.md) in the repo. Both end up in the same context window, and neither knows the other exists.
The collision happens when an extension’s guidance contradicts the repo’s instructions. Your extension’s tool description says "always use the v3 SDK." The team’s copilot-instructions.md says "we use the v2 SDK for legacy compatibility." Which one wins? The model sees both, and the outcome depends on context ordering and phrasing strength, filtered through whatever the model’s training data says about the technology. The result is unpredictable, and it can flip between runs.
And it happens between extensions too. Two extensions that both provide guidance on authentication patterns, each assuming it’s the only authority. The model gets contradictory instructions with no way to tell which one matters.
Resource...