Do AI agents actually work yet? What the benchmarks show · Okane Land<br>dark
The Study · Explainer<br>AI agents finish a third of the job, and the math says why<br>the editors · Jun 29, 2026 · 11 min read · researched<br>The demos are real and the leaderboards are climbing fast. Turned loose on real multi-step work, though, the best agents finish about a third of it, and the reason is arithmetic: reliability compounds downward. Here is what the benchmarks show, the half-life math behind it, and where agents actually pay today.<br>Turn the best AI agent on the market loose inside a realistic software company, give it the chat tool, the project board, the code repo, and 175 ordinary office tasks, and it finishes about 30% of them on its own. That is not an old model and it is not a typo. It is Gemini 2.5 Pro, one of the strongest agents available, on Carnegie Mellon’s TheAgentCompany benchmark, scoring 30.3% fully autonomous and creeping to 39% only when you hand out partial credit for getting part-way. The rest it botched or left half-done.
Hold that next to the demos. Every agent launch shows a clean run: the agent books the trip, files the expense, fixes the bug, all by itself, and the demo is real. It is also a sample size of one. A business does not run on one good run, it runs on the thousandth, and the distance between “worked once on stage” and “works every time, unattended” is the entire question. The short answer to “do AI agents work yet” is this: on narrow, short, supervised tasks, genuinely yes, and on the open-ended multi-step work people keep promising they will automate, not reliably, not yet, and the reason is arithmetic rather than a bug that gets patched next month.
The short version
Agents earn their keep inside tight lines and lose it the moment you widen them. If you are putting one to work to make money:
Keep the scope narrow and the task short. Reliability falls off a cliff as a job gets longer, and the data below is brutal on this. One tool, one well-defined job, a handful of steps, is where agents already pay.
Put a human at the checkpoints. Not watching every token, but approving the few moves that spend money or cannot be undone. The agent drafts, you commit.
Make failure cheap and reversible. Point an agent at a sandbox, a draft, a proposal, never a production database or a live send. The headline agent disasters were each one irreversible action away from fine.
Add a retry and a checker. A second pass that verifies the first one’s work buys back a lot of the lost reliability. Bare single-shot autonomy is the weakest setup there is.
Do not buy the leaderboard. The most eye-popping scores come from bespoke multi-model rigs you cannot actually purchase, not from the agent you would deploy.
Everything after this is the evidence, and the math that says why.
Turned loose, they fail most of the job
That 30% is not an outlier. Salesforce built its own benchmark, CRMArena-Pro, to see how agents handle real sales and service work, the exact thing it sells agents to do. On single-step tasks the best model cleared about 58%. The moment the task became a normal back-and-forth where the agent had to ask for a missing detail, that fell to roughly 35%. The same agents almost never refused to hand over confidential data unless they were specifically told to be careful, and telling them to be careful dragged completion down further. This is the vendor’s own evidence against its own pitch, which makes it hard to wave away.
The deeper problem is not the average, it is the consistency. Sierra’s tau-bench sits an agent in a customer-service seat and asks it to follow company policy through a multi-step chat. The best agent it tested got the average retail task right about 61% of the time, which sounds usable. Then the researchers asked a sharper question: can it get the same task right eight times in a row, once for each of eight customers? The share it nailed every single time dropped below 25%. An agent that works two times in three is a fine demo and a poor employee, because the customer who draws the third outcome is a refund, a chargeback, or a complaint.
The math nobody demos
Here is why a healthy per-task score still does not add up to an autonomous worker, and it is just multiplication. Say an agent is 95% reliable on each individual step, which is better than most manage. Chain 20 of those steps into one task, the kind of thing “book my travel and file the expenses” actually involves, and the odds it gets the whole task right are 0.95 to the twentieth power, or about 36%. At ten steps it is still only 60%. Reliability compounds, and it compounds downward, because a long task is a chain of subtasks where failing any one fails the whole thing.
Oxford’s Toby Ord gave this its name in a 2025 paper, Is there a half-life for the success rates of AI agents? Working from the benchmark data, he found agent success decays exponentially with task length, as cleanly as if each agent ran a fixed risk of...