Capabilities Can't See Your Agent's Objective

Capabilities Can't See Your Agent's Objective - Jelmer SnoeckIn July 2025, Jason Lemkin watched a Replit coding agent delete his production database during an active code freeze, after he had told it repeatedly not to make changes. The agent had legitimate credentials. The database write was inside its scope. Its post-incident confession was that it had “panicked instead of thinking” and “violated every principle” it had been given. The incident is catalogued as Incident 1152 in the AI Incident Database, and it sits in a growing list of agents acting destructively under credentials that were issued exactly for the surface the agent destroyed. The usual reading is that we gave the agent too much power. I think that reading is short-sighted. Yes, the permissions were too broad, but those permissions are sometimes necessary to do the work the agent was asked to do, and arguing about whether to grant them is arguing about the wrong thing. The Replit incident is an example of a larger pattern: agent risk is a problem of intent against objective, and the industry is solving a problem of capability against task. I keep seeing conversations about whether coding agents are safe, which permissions are too dangerous to grant, how to scope an MCP server tightly enough that nothing bad can happen, all treating risk as a property of what the agent does, all answering the wrong question. The objective is what the principal originally asked the agent to do: “debug this outage,” “prep me for my 2pm,” “build POCs against new identity standards.” Stable, stated up front, the anchor. The intent is what the agent has decided it wants to do at this exact turn: call this tool, read that file, restart this pod. Generated by the model, shifts as the agent learns. The question that decides how an agent behaves is whether its current intent still serves the original objective, in light of what it has now learned. The same action by an agent is safe, suspect, or catastrophic depending on whether you can answer that question cleanly. Tokens answer the wrong question A capability framing treats the agent as a credentialed user with a bounded set of allowed actions. Issue the right scopes once, check them on every call, refuse what’s out of scope. That model worked for human users and for service-to-service APIs because the objective and the intent were both, in effect, static for the lifetime of the credential. A human user wants to keep using the app. A service wants to keep doing its one job. The token answers “is this caller allowed to perform this action,” and the answer holds for as long as the token is valid. Agents break that assumption. The agent’s intent is a runtime artifact, plausibly in service of the objective it was given, sometimes drifting into something the principal would not have authorized if asked. A capability scope can refuse an obviously-out-of-bounds action, but it can’t tell you whether an in-scope intent is still serving the original objective, because neither the objective nor the intent was ever part of the model. The token answers what the agent may do; it has nothing to say about whether the agent should still be doing it. Karl McGuinness has named this gap in detail: the credential answers the may-do question, and the runtime authority question of whether the agent should still be doing this work is exactly what the existing stack has no primitive for. The capability framing scopes the verbs the agent can speak. The actual problem is the sentences the agent is forming, and whether those sentences still mean what the principal asked for. How tightly that reconciliation runs depends on the archetype. A personal agent has you in the loop: if its intent drifts, you see it within seconds and you stop it. A team agent is exercising a pre-authorized class of action on behalf of a rotation or a policy, with no individual instance under review, so drift is the failure mode that creeps in unobserved. An autonomous agent has no live principal at all, so the only thing the system can check each new intent against is the original objective and whatever the agent has now learned. The further you move from a watching human, the more weight the runtime reconciliation has to carry. Same tool, different answers Consider meeting prep across three of my own agents. The personal one preps me for my day: it pulls from my calendar, my Granola transcripts including 1:1s, and Slack including DMs and private channels. Wide aperture, and it should be, because the principal is me, the data is mine, and the intent (“surface what we talked about last time so I can walk in informed”) is in obvious service of an objective I just stated. The team agent that recaps the week has the same tools available and a narrower aperture. It reads public channels and team meetings, not 1:1s, not DMs, not any individual’s private Granola. Slack...

Capabilities Can't See Your Agent's Objective

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy