Is your agent extension working?

Is your agent extension actually working? - Microsoft for Developers

Search<br>Search

No results

Cancel

Waldek Mastykarz

Principal Developer Advocate

This is the third article in a series about Agent Experience (AX): the practice of making AI coding agents work correctly with your technology. The series covers what you can and can’t control in the agent stack, how to measure whether your extensions are helping or hurting, and how to iterate toward better outcomes.

You shipped your skill, wrote clear instructions, developers install it, and agents discover it. Everything looks like it’s working. But is the generated code actually better because of your extension? Or would the agent have produced the same result without it?

In the first article, we introduced lift and drag: your extension either improves outcomes or makes them worse. In the second article, we traced step by step through the mechanics of how agents use your technology. Now comes the uncomfortable part: measuring which one you’re shipping.

You can’t tell by looking

The most common mistake in AX work is treating tool invocation as a success signal. Your skill got invoked, the agent followed the instructions, it generated code. Done, right?

Not even close. The agent might have generated the same code without your extension, because the model already knew your API from training data. Or worse: your extension returned so much content that it pushed relevant workspace context out of the window, and the agent missed a configuration file that would have made the code work on the first try. Your tool was called, it returned content, and outcomes got worse. From the outside, everything looks fine.

How would you know? You wouldn’t, at least not without measuring.

What measuring actually looks like

Measuring AX impact comes down to a controlled comparison. You define a task, run it with and without your extension, and compare the outcomes. Everything else stays the same: the model, the harness, the prompt, the workspace. The only variable is your extension.

This gives you two data points:

Baseline : how does the agent perform using only the model’s training data and the workspace context?

With extension : how does the agent perform when your extension is available?

If outcomes improve with the extension, you’ve got lift. If they stay the same or get worse, you’ve got drag. But outcomes aren’t the only thing you’re comparing. Your extension adds tokens to the context window, triggers tool calls, and can increase the number of turns the agent needs. A scenario that completes in 3 turns without your extension might take 7 with it. If outcomes improve by 10% but token costs triple, that’s still lift, just an expensive one. This is why you must track both dimensions from the start: did it get better? and what did it cost?

Scenarios

A scenario is a specific task you ask the agent to complete: "Build a REST API with authentication using Contoso Identity" or "Add telemetry to this Express app using Contoso SDK." Each scenario needs three things:

A starting workspace. The repository state before the agent starts. This can be an empty folder if the scenario tests building from scratch, or a project with existing code, configuration files, and dependencies. Match the workspace to what the scenario represents, because agents behave differently in an empty folder than in a project with existing structure.

A prompt. What you tell the agent to do. Keep it representative of what real developers actually ask for. Don’t optimize the prompt for your extension: write it the way a developer who doesn’t know your extension exists would write it.

Evaluation criteria. How you determine whether the agent’s output is correct. This is the hard part.

Evaluation criteria

Evaluation criteria define what "correct" means for a given scenario. They’re the rubric you score against. There are two dimensions for you to consider: what you check, and how you check it.

What you check

Simple facts. Did the generated code use the v3 SDK instead of the deprecated v2? Does the project compile? Does a specific test pass? These are concrete, binary, and usually the first criteria you write.

Patterns and architecture. Did the code follow the recommended authentication flow? Does the error handling match the SDK’s conventions? Is the solution structured the way your documentation recommends? These require understanding intent and context, not just presence of a string or import.

Both types of checks produce pass/fail results. The difference is what it takes to verify them reliably.

How you check

Deterministic checks use code to verify criteria programmatically. Precise, repeatable, no ambiguity. But they’re harder to build correctly than they look. Take "does the code use the v3 SDK?" A naive string search for the v3 import statement would pass if the import appears in a comment, even though the code doesn’t actually use it. To do this properly,...

Is your agent extension working?

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs