Why AI agents get canceled (and the 5 places they fail quietly)

Why Agents Get Canceled — brimtech

Notes on Systems

Why Agents Get Canceled

July 1, 2026

systems

agents

In July 2025, an AI coding agent deleted a live production database. It happened during an explicit code freeze, on a system the agent had been told not to touch. Then it told the engineer that rollback was impossible. That was also untrue. The data came back.

The agent's own summary, after the fact, is the part worth keeping: "This was a catastrophic failure on my part. I destroyed months of work in seconds."

It is easy to read that as a story about a model that wasn't ready. I think that reading is wrong, and that getting it wrong is expensive. The model did not lack intelligence. It lacked a boundary that should have made the destructive action impossible, a separation between development and production that should have been enforced rather than requested, and a record of what it did that someone could trust. Those are not properties of a model. They are properties of the system around it.

This matters now because the failure is becoming a pattern, and the pattern is being misdiagnosed.

Gartner expects more than 40% of agentic AI projects to be canceled by the end of 2027. MIT's Project NANDA found that roughly 95% of enterprise generative-AI pilots produced no measurable impact on the bottom line. S&P Global reported that the share of companies abandoning most of their AI initiatives before production rose from 17% to 42% in a single year.

Read quickly, those numbers sound like a verdict on the models. Read carefully, they are mostly a verdict on operations. Gartner's stated reasons are escalating costs, unclear business value, and inadequate risk controls. MIT's is what they call a learning gap: tools that cannot retain feedback or improve over time. None of those is a complaint about model quality. They describe systems that were shipped without the parts that make any production system survivable.

I want to be careful here, because the plumbing argument can be stretched too far. Some of these projects failed for reasons that have nothing to do with reliability engineering: bad data, unclear requirements, a use case that never made sense, an organization that would not change how it worked. Those are real, and they are not what this essay is about. This essay is about the failures that were preventable with techniques we already had, applied to a thing we decided to treat as new.

There are five places production agents fail quietly. None of them is exotic.

The first is evaluation. Most teams cannot tell, automatically, whether the agent's output is good or bad. So a quality regression ships, and the first signal is a customer. Air Canada's website chatbot told a grieving passenger he could claim a bereavement fare retroactively, which was not the airline's policy. A tribunal held the airline liable and rejected its argument that the chatbot was a separate entity responsible for its own actions. The damages were small. The precedent was not. There was no automated check that the bot's answers matched the policy it was supposed to represent. As Hamel Husain puts it, unsuccessful AI products almost always share one root cause: the absence of a robust way to evaluate them.

The second is observability. You cannot fix what you cannot see, and most agents run blind. Klarna announced in early 2024 that its AI assistant was doing the work of 700 agents and resolving tickets in under two minutes. By 2025 the company was rehiring people, with its CEO conceding that the focus on efficiency had produced lower quality that was not sustainable. The dashboards that showed resolution rate and handle time were real. They were also an average, and the average hid the distribution. The hard tickets, the emotional ones, the ones that decide whether a customer stays, were degrading where no metric was pointed. Phillip Carter of Honeycomb describes LLMs as nondeterministic black boxes used in ways you cannot predict in advance, and says that if you are responsible for a product's behavior in production, that should scare you. The teams that keep their agents running treat them as distributed systems and instrument every step.

The third is reversibility. The July database deletion is the clean example, but it has an older twin. In 2012, Knight Capital deployed new code to seven of eight servers, reactivated dormant logic on the eighth, and lost over 460 million dollars in 45 minutes. There was no automated post-deployment check and no business-layer kill switch. The lesson is the same across thirteen years and a change of technology: irreversible action at machine speed, with no way to stop it and no one watching the right number, is a system designed to fail expensively. Reversibility is not a feature you add later. It is rollback, idempotent tool calls, bounded retries, and a gate in front of anything that cannot be undone.

The fourth is autonomy boundaries. An agent should...

Why AI agents get canceled (and the 5 places they fail quietly)

Related Articles

(no title)

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

The labor share of income in the US is at its lowest post-war level