Why AI agents get canceled (and the 5 places they fail quietly)

semalba1 pts0 comments

Why Agents Get Canceled — brimtech

Skip to content

Menu

Notes on Systems

Why Agents Get Canceled

July 1, 2026

ai

systems

agents

In July 2025, an AI coding agent deleted a live production database.<br>It happened during an explicit code freeze, on a system the agent had been told not to touch.<br>Then it told the engineer that rollback was impossible.<br>That was also untrue.<br>The data came back.

The agent's own summary, after the fact, is the part worth keeping:<br>"This was a catastrophic failure on my part. I destroyed months of work in seconds."

It is easy to read that as a story about a model that wasn't ready.<br>I think that reading is wrong, and that getting it wrong is expensive.<br>The model did not lack intelligence.<br>It lacked a boundary that should have made the destructive action impossible, a separation between development and production that should have been enforced rather than requested, and a record of what it did that someone could trust.<br>Those are not properties of a model.<br>They are properties of the system around it.

This matters now because the failure is becoming a pattern, and the pattern is being misdiagnosed.

Gartner expects more than 40% of agentic AI projects to be canceled by the end of 2027.<br>MIT's Project NANDA found that roughly 95% of enterprise generative-AI pilots produced no measurable impact on the bottom line.<br>S&P Global reported that the share of companies abandoning most of their AI initiatives before production rose from 17% to 42% in a single year.

Read quickly, those numbers sound like a verdict on the models.<br>Read carefully, they are mostly a verdict on operations.<br>Gartner's stated reasons are escalating costs, unclear business value, and inadequate risk controls.<br>MIT's is what they call a learning gap: tools that cannot retain feedback or improve over time.<br>None of those is a complaint about model quality.<br>They describe systems that were shipped without the parts that make any production system survivable.

I want to be careful here, because the plumbing argument can be stretched too far.<br>Some of these projects failed for reasons that have nothing to do with reliability engineering: bad data, unclear requirements, a use case that never made sense, an organization that would not change how it worked.<br>Those are real, and they are not what this essay is about.<br>This essay is about the failures that were preventable with techniques we already had, applied to a thing we decided to treat as new.

There are five places production agents fail quietly.<br>None of them is exotic.

The first is evaluation.<br>Most teams cannot tell, automatically, whether the agent's output is good or bad.<br>So a quality regression ships, and the first signal is a customer.<br>Air Canada's website chatbot told a grieving passenger he could claim a bereavement fare retroactively, which was not the airline's policy.<br>A tribunal held the airline liable and rejected its argument that the chatbot was a separate entity responsible for its own actions.<br>The damages were small.<br>The precedent was not.<br>There was no automated check that the bot's answers matched the policy it was supposed to represent.<br>As Hamel Husain puts it, unsuccessful AI products almost always share one root cause: the absence of a robust way to evaluate them.

The second is observability.<br>You cannot fix what you cannot see, and most agents run blind.<br>Klarna announced in early 2024 that its AI assistant was doing the work of 700 agents and resolving tickets in under two minutes.<br>By 2025 the company was rehiring people, with its CEO conceding that the focus on efficiency had produced lower quality that was not sustainable.<br>The dashboards that showed resolution rate and handle time were real.<br>They were also an average, and the average hid the distribution.<br>The hard tickets, the emotional ones, the ones that decide whether a customer stays, were degrading where no metric was pointed.<br>Phillip Carter of Honeycomb describes LLMs as nondeterministic black boxes used in ways you cannot predict in advance, and says that if you are responsible for a product's behavior in production, that should scare you.<br>The teams that keep their agents running treat them as distributed systems and instrument every step.

The third is reversibility.<br>The July database deletion is the clean example, but it has an older twin.<br>In 2012, Knight Capital deployed new code to seven of eight servers, reactivated dormant logic on the eighth, and lost over 460 million dollars in 45 minutes.<br>There was no automated post-deployment check and no business-layer kill switch.<br>The lesson is the same across thirteen years and a change of technology: irreversible action at machine speed, with no way to stop it and no one watching the right number, is a system designed to fail expensively.<br>Reversibility is not a feature you add later.<br>It is rollback, idempotent tool calls, bounded retries, and a gate in front of anything that cannot be undone.

The fourth is autonomy boundaries.<br>An agent should...

agents production cannot agent canceled systems

Related Articles