We Gave the Robot a Team | Seyhan Dzhamursdzhamur@home:~$
← Back to blogWe Gave the Robot a Team<br>14/06/2026|8 min read|#ai#agents#developer toolsA year or two ago the message was simple: use more AI, or get left behind. Every standup, every newsletter, every vendor deck said the same thing. Management heard it loud and clear, and "are you using AI for this?" quietly became a normal question in code review. Some of us were nudged. Some of us were told. Either way, AI moved from a tab we sometimes opened to something wired into the editor, the pipeline, and half the tickets.
Fine. A lot of it genuinely helps, and I'm not going to pretend otherwise.
But now the part that wasn't on the slide is showing up: the invoice.
A chat window is cheap. An agent is not.
Here's what a lot of people missed. Asking a model a question in a chat box is one call. You type, it answers, you move on. An agent doesn't work like that. One request fans out into a small crowd of calls. The agent plans, searches the codebase, edits a few files, reviews its own work, notices it got something wrong, retries, and every so often decides the job needs a couple of helper agents to finish. Apparently one robot wasn't expensive enough, so we gave it a team.
This isn't a hunch. Anthropic, who build the things, put numbers on it: agents use roughly 4x more tokens than a normal chat, and multi-agent setups around 15x more. Same question, fifteen times the bill. You don't feel it on one task. You feel it at the end of the month, when finance walks over wearing a particular kind of face.
And they have walked over. Uber reportedly capped engineers at $1,500 a month per coding tool after burning through its annual AI budget in four months. Four months. That's not a rounding error, that's a pattern.
Now, the easy assumption is that tokens are getting more expensive. They aren't. If anything it's the opposite. By Stanford's AI Index, the price to run a GPT-3.5-class model fell from about $20 per million tokens in late 2022 to roughly $0.07 two years later. Per token, prices are in freefall. The bills go up anyway, because we burn far more tokens than the price keeps dropping. Enterprise spend on these models roughly doubled in just the first half of 2025. Cheaper units, much bigger appetite. The cloud providers win either way.
Using the best model for everything is just waste
So here's where I landed, and it isn't clever. Using the most expensive flagship model for every single action is like sending your most senior architect to rename a variable. Sure, they can do it. They'll do it beautifully. But you're paying architect rates to move a brick, and there are a lot of bricks in a building.
Most of what an agent does in a day isn't hard. It's predictable. Write the boilerplate, add the test, update the doc, summarise the file, run the search. None of that needs the smartest model on the market. It needs a model that gets it right reliably and costs a fraction.
And I want to be clear about one thing, because it's where people get this wrong: this is not "always pick the cheapest option." Cheap-but-wrong is the most expensive thing there is, since a failed task means retries, cleanup, and you doing it by hand anyway. The goal is the least expensive model that still does the job properly. That's a completely different sentence.
My planner-and-worker setup
What I've been experimenting with looks roughly like this:
The strong model plans. Architecture, the tricky reasoning, the decisions that are painful to undo. That's where it earns its price tag.
That plan gets broken into smaller, clear tasks. Deliberately boring ones.
Those tasks go to faster, cheaper models.
Specialised agents handle the predictable stuff: implementation, tests, docs, research, code analysis. Each one decent at one job.
The strong model only comes back for a real decision, a conflict, or the final review that's actually worth its rate.
The flagship model behaves like a tech lead, not the intern who does everything. It decides what gets built and it checks the result. It does not need to lay every brick personally. Most of the work, honestly, shouldn't reach it at all.
One honest caveat, because none of this is free either. Wiring it up yourself, the routing, the handoffs, five models that each fail in their own special way, is real engineering work, and you can easily burn more time babysitting the orchestration than you ever save in tokens. So mostly, don't build it from scratch. The tooling is growing this in, and the routing is slowly becoming somebody else's problem to maintain. The trap isn't using more than one model, it's hand-rolling a fragile contraption to shave pennies off a job one solid model would have handled. The setup earns its keep when you're running the same flows all day, every day. On a one-off script, it's a cathedral with a very small congregation.
The boring tricks that actually work
None of this is secret, and teams under cost pressure...