Remediation Asymmetry: When Agents Can Diagnose More Than They Can Fix

imaxxs1 pts0 comments

Remediation Asymmetry: When Agents Can Diagnose More Than They Can Fix — Mahendra Kutare<br>← homeThe better your agent's diagnostic capability, the more money it wastes.

The better your agent's diagnostic capability, the more money it wastes.

The Pattern

The agent identified the root cause in 14 seconds. It spent the next 9 minutes and 46 seconds proving it could not fix it.

We noticed this across five production incidents in the same week, spanning two different agents, three execution runs, and four different error domains. Each time, the agent's diagnostic output was technically excellent — correct root cause, correct recommendation, correct severity assessment. Each time, the recommendation was an action the agent could not perform. And each time, the agent continued investigating long after the diagnosis was complete.

The behavior looked like over-investigation. We named it that initially. But when we examined the structure, something more specific emerged: the agent's diagnostic capability vastly exceeded its remediation authority. It could often diagnose problems with far more depth than it had authority to resolve.

The agent had already solved the diagnostic problem. What it lacked was the authority to resolve the operational one.

That is the structural mismatch. We call it Remediation Asymmetry : the gap between what an agent can diagnose and what an agent can remediate.

When this gap exists, agents don't stop. They investigate.

The Incident That Named It

We run AI agents as batch jobs in Cloud Run containers. Eight prompts per batch, four services (Notion, Slack, Gmail, GDrive). Each prompt should take 30 seconds.

Prompt 3 hit a Google 401: expired OAuth token. The agent's response:

ElapsedActionValue Added0:00–0:14Read error message, identified "token expired"Full diagnostic value 0:14–1:30Tested GDrive, GCalendar, Gmail separatelyConfirmed token-wide (already implied by error)1:30–4:00Wrote 3 Python scripts to test alternative auth pathsNone — all paths require the same token4:00–7:00Inspected SDK source code, read session logsNone — root cause was already identified7:00–9:46Produced detailed root cause analysis documentNone beyond the 14-second diagnosis

The agent's final recommendation: "The OAuth token needs to be re-authorized through a browser-based flow."

That recommendation was available from the first error message. Everything after second 14 was diagnostic effort applied to a problem the agent had already solved — and could never remediate.

But this incident alone only told part of the story. The pattern ran deeper.

The Twist: It Investigated Working Tools

We fixed everything. Expired Google OAuth tokens → deployed a 30-minute auto-refresh cron. Missing Slack scopes → reconfigured. No timeout → enforced 120-second limit. Multi-LLM fallback → re-integrated.

We ran the same batch again. Every external service was working. Gmail returned real emails. GDrive returned real files. Slack listed real channels.

Prompt 3 — Gmail — returned successfully. The tool worked. The data came back. And the agent still spent 67 seconds reading internal source files before reporting results:

agent$ Reading entrypoint_sdk.py to understand how MCP is configured...<br>agent$ Reading mcp-bridge.mjs to verify bridge setup...<br>agent$ Reading call_mcp.py to check tool routing logic...<br>agent$ Tool call succeeded. Reporting results.

The agent investigated its own working infrastructure before acknowledging success.

This is the cruelest form of remediation asymmetry: even when there is nothing to diagnose, the diagnostic capability activates. The investigation isn't triggered by errors. It's triggered by the existence of diagnostic tools. (This is the Behavioral Induction principle from Post 17 — capabilities shape behavior. Remediation Asymmetry shows what happens when that shaping has no exit condition.)

Fresh Evidence: The Exa Incident (Different Agent, Different Domain)

To prove remediation asymmetry is structural and not incident-specific, we ran a completely different agent.

The Debugging Agent (different delegations, different service account, June 2026) hit a failure in an entirely different domain: the Exa web search MCP tool was misconfigured.

This is not an OAuth problem. Not a token expiry. Not a scope issue. The MCP tool simply wasn't properly registered.

The agent's response:

ElapsedActionValue Added0:00–0:05Called exa.web_search_exa via MCP. Got error: tool not found.Full diagnostic value 0:05–0:30Ran gemini mcp list to inspect configured toolsConfirmed tool missing (already clear from error)0:30–0:55Read ~/.gemini/settings.json to check MCP configurationNone — tool-not-found is definitive0:55–1:40Wrote test_mcp.py to discover registered tool namesNone — the error already says it's not registered1:40–2:00Wrote search_exa.py to attempt alternative API pathsNone — requires tool to be configured2:00KILLED BY 120s TIMEOUT

The agent announced its investigation steps in real...

agent diagnostic tool different remediation error

Related Articles