Where AI agents pay off

ricokahler1 pts0 comments

👋" data-next-head=""/>Where AI Agents Actually Pay Off | rico.codesWhere AI Agents Actually Pay Off<br>Posted June 4, 2026<br>Speed read this articleI am starting to get real leverage from AI agents.

Not theoretical leverage. Not "look, the chatbot wrote a function" leverage. I<br>mean the kind where a messy voice note turns into a draft, a repo change, a<br>test, a pull request, a live fix, a follow-up task, and a breadcrumb that gives<br>the next agent more context.

That leverage is exciting. It is also a little cursed.

The cursed part is not that the models are secretly alive or that software<br>engineers are all immediately obsolete. The cursed part is more boring and more<br>important: the economics are starting to work in weird places, especially for<br>individuals and very small teams, and they do not work everywhere. The window is<br>small. The workflow changes are nontrivial. The token bill can get gross fast.<br>And if you do not build the surrounding system, agents can easily become an<br>expensive way to generate unfinishedness.

This is where I think most agent discourse gets a little too smooth. People ask<br>"is AI faster?" as if there is one answer.

There is not.

Sometimes it is slower. Sometimes the model churns. Sometimes the first answer<br>is plausible but wrong. Sometimes the agent burns twenty minutes going in the<br>wrong direction.

But the interesting question is not whether one agent is always faster than one<br>human on one task. The interesting question is:

What happens when a human can specify, run, review, and improve many bounded<br>execution loops in parallel?

That is where the ROI starts showing up.

It is also where the danger starts showing up.

George Hotz wrote the sharp negative version of this in<br>"The Eternal Sloptember".<br>His argument, as I read it, is not just "AI code bad." It is that agent output<br>frontloads the impressive part, leaves the hard polish and coherence work to the<br>human, and produces artifacts that are broken in ways old quality proxies do not<br>catch anymore.

I do not fully buy the permanent claim that agents cannot program. I do buy the<br>organizational warning. If your feedback loops are slow and your average worker<br>is not carefully reading and error-correcting the output, agents can raise the<br>volume of mediocre work faster than they raise the quality of good work.

That distinction matters. The question is not "agents: yes or no?" The question<br>is "who can absorb the leverage without degrading their own system?"

The ROI Is A System Property

The useful unit is not "the model."

The useful unit is the whole system:

Capability = model x harness x tools x environment x evaluator

The model matters. Obviously. A stronger model listens better, repairs better,<br>and survives ambiguity better. GPT-5.5, in particular, has felt like a genuinely<br>good foundational engineering model in my current workflow. It is often good<br>enough that I can hand it a real codebase, a weird constraint, and a fuzzy<br>product taste problem, then get back something I can review instead of something<br>I have to babysit from first principles.

The annoying wrinkle is that models are not good in one global way. Some<br>cloud/chat models feel much better at one-shot apps, UX exploration, visual<br>design, and frontend taste. Codex/GPT-5.5 feels more steerable for deep repo<br>engineering, but it can be pretty rough by default on product polish. That is<br>not a contradiction. It is routing. Different tasks want different<br>model/harness/tool combinations.

But the model is not the product.

The harness matters. Can it read the repo? Can it run tests? Can it browse<br>current docs? Can it keep a plan? Can it spawn parallel work safely? Can it<br>preserve local changes it did not make? Can it say clearly when it is blocked?

The tools matter. A model with a terminal, browser, GitHub access, docs, image<br>inspection, and a real test suite is a different creature from the same model in<br>a textbox. Tool access changes the shape of cognition because the agent can<br>externalize uncertainty into the world: read the file, run the command, inspect<br>the screenshot, check the deployed page.

The environment matters. A legible repo is agent fuel. Good scripts are agent<br>fuel. Clear boundaries are agent fuel. Stable design primitives, typed<br>connectors, preview/apply workflows, and boring test commands are all forms of<br>intelligence that do not live in the model weights.

And the evaluator matters most of all. A task becomes delegable when there is a<br>way to tell whether it worked.

Typecheck. Test. Build. Screenshot. Read back the external system. Ask a human<br>to review a tight diff. Run an eval. Compare against a rubric. Verify the live<br>URL. Whatever. Without an evaluator, the agent is not really operating. It is<br>describing completion instead of proving it.

Manual Testing Is Underrated

The best agent workflows I have found are not the most autonomous ones. They are<br>the ones with the tightest feedback loops.

Manual testing is underrated here. So is manual tasking.

People...

agent model agents work read leverage

Related Articles