Gemini CLI vs. Claude Code: Why agent capabilities matter more than prompts

imaxxs1 pts0 comments

Behavioral Induction: How Capabilities Shape Agent Execution — Mahendra Kutare<br>← homeThe failure mode of an autonomous agent isn't that it does nothing. It's that it does too much.

The failure mode of an autonomous agent isn't that it does nothing. It's that it does too much.

The Incident

We expected the prompt to take 30 seconds. List the files in a Google Drive folder. The agent has the credentials, the MCP tool, and the delegation grant. It's a single API call.

The agent returned in 10 minutes. It had written 5 Python scripts. It had inspected our SDK source code. It had tested 3 other Google APIs (GCalendar, Gmail, GDrive) to isolate whether the failure was service-specific or token-wide. It produced a root cause analysis so accurate we could have filed a bug report from it.

The root cause: the Google OAuth token had expired. The error message said so. The correct behavior in a batch job: report "OAuth token expired, re-authorization required," advance to the next prompt. Total time: 15 seconds.

Instead, the agent spent 10 minutes producing an elaborate diagnosis for a problem it could identify from the first error message — and could never remediate. We had to cancel the job.

PromptServiceExpectedActualWhat Happened1Notion30s27s2 MCP calls, results returned. Normal.2Slack30s~5 minlist_channels failed. Agent wrote 4 Python scripts to bypass the CLI, tested both delegation tokens, eventually succeeded via direct gateway HTTP call.3Gmail30s~3 minGot 401 from Google. Agent wrote 5 Python scripts, inspected SDK source, tested GDrive + GCalendar, read previous session logs, produced a root cause analysis.4GDrive30s10+ minSame 401. Same investigation loop. Job cancelled by operator.

The agent correctly identified the root cause in every case. The investigation was technically excellent. And it was completely wrong for a bounded batch task.

The Question

The obvious explanation was that the agent was misbehaving. But the evidence didn't fit.

We ran a second batch with all the tool errors fixed. The agent still investigated — even when every tool returned real data with no errors. We ran a completely different agent with different prompts, different delegations, and a different service account. Identical behavior.

The prompts were different. The delegations were different. The services were different. The failures were different. Yet the behavior was identical.

Every time the agent encountered uncertainty, it investigated. Every time it had shell access, it used it. Every time it could write scripts, it wrote them.

The pattern wasn't in the prompt. It wasn't in the task. It wasn't in the errors.

It was in the capabilities. We call this Behavioral Induction : the available capabilities shape what an agent actually does — more than the instructions it receives.

The agent didn't investigate because the prompt said "investigate." It investigated because investigation became possible.

The Proof

The agent didn't investigate because the prompt said "investigate." It investigated because investigation became possible. It had write_file and run_shell. The capabilities created the behavior.

Three observations from our production incidents prove this:

#ObservationWhat It Proves1Same prompt, different capabilities → different behaviorThe capability set is the variable, not the prompt2Same capabilities, different agents → same behaviorThe behavior is capability-induced, not agent-specific3No errors → still investigatesThe induction is not triggered by failure

This is not a model quality issue. This is not "Gemini bad, Claude good." The same behavior that makes Gemini CLI brilliant for interactive debugging makes it catastrophic for batch execution. The capabilities induced the behavior. Remove shell and filesystem access, and Gemini would likely behave like Claude.

Same prompt. Same gateway. Same permissions. Different capabilities. Different behavior.

With the thesis established, the three incidents become evidence.

Evidence 1: The Investigation Mechanism

The Slack sequence from the first incident reveals how behavioral induction works in practice.

The agent called slack.list_channels via MCP. The tool returned an error — the Slack app didn't have the channels:read scope. A human would see "scope missing" and report it. The agent saw a problem to solve.

1Inspected delegation tokens Had two tokens (one per delegated user). Tried the second. Same error.<br>2Wrote test_slack_direct.py Bypassed Gemini CLI's MCP bridge, called the DeepSecure gateway directly via HTTP.<br>3Parsed error response Identified Slack API error format, wrote a second script with proper error handling.<br>4Direct call succeeded Channel list retrieved. Phone numbers redacted from JSON. ~5 min elapsed.

The agent demonstrated genuine engineering capability — hypothesis formation, iterative testing, error handling, even data privacy awareness (the phone number redaction). In an interactive debugging session, this would be impressive. In a...

agent different capabilities error behavior prompt

Related Articles