Observations on AI agent token consumption

Observations on AI agent token consumption | Will Hackett

Your subscription has been confirmed. Thank you.

← Back

A new paper from researchers at Stanford, Michigan, DeepMind, All Hands, Microsoft AI and MIT is the most detailed open empirical study I’ve seen of how AI agents actually spend tokens at scale1. The authors run eight frontier models across 500 SWE-bench Verified tasks with four runs each, capturing full trajectory telemetry decomposed by token type, phase and action. They release the dataset alongside the paper, which is to my knowledge the most granular public corpus of agentic trajectories currently available.

The paper is rigorous, careful about what it claims and puts hard numbers on questions that have until now only been answered with anecdotes. I’d recommend reading it in full.

What follows is a walk through four of the paper’s observations, interleaved with what we are seeing at Flowstate from the exact same patterns surfacing in customer environments. We sit in the request path between the user and the AI provider, which means we observe the same trajectories the paper analyses, but in production, across a much broader set of AI tools than SWE-bench covers.

The two sets of observations are remarkably close. The researchers measured it on a benchmark; we see it on customer devices. The agreement between the two is what makes this paper so useful for anyone trying to actually manage this spend.

Input tokens dominate agentic spend

The paper’s first finding is that agentic coding consumes around 1,000 times more tokens than equivalent code-chat or code-reasoning tasks, with an input-to-output ratio of roughly 153:1 (against 1.33 for chat and 0.16 for reasoning)2.

The reason is structural. Agentic workflows accumulate context across rounds, and the same content is fed back into the model on every single turn. Token caching helps at the margins, but the sheer volume of accumulated context dominates the cost.

This is the exact pattern we see in non-agentic AI usage as well. Chat-style usage of Claude, ChatGPT and similar tools follows the same shape because users continue conversations across days rather than starting fresh sessions with explicit context. One customer described it to us this way:

“We think they’re creating PowerPoints, and then they’re like, ‘change this word on slide three’, and then they’re just continuing to generate these really large documents.”

That is the paper’s finding in human form. A chat session that should have been a fresh prompt becomes a thread that re-pays for its entire history on every turn. The user thinks they are making one small edit. The model is being asked to re-process the entire document. The vendor charges accordingly.

The implication is that a massive share of controllable AI cost sits upstream of the model. Better prompts. Fresh sessions. Explicit context provided once, rather than constructed iteratively over an afternoon. The agent’s behaviour is largely a consequence of how it was set up.

Model choice produces order-of-magnitude cost differences

On the 230 SWE-bench tasks that every tested model successfully solved, Kimi-K2 and Claude Sonnet 4.5 used on average 1.5 million more tokens than GPT-53. Same problems, same correct answers, vastly different token appetites.

The paper is careful to rule out the obvious explanation: the cost gap persists on both the shared-success subset and the shared-failure subset. The more expensive models were not tackling harder problems. They were simply spending more tokens on the same problems.

This matches a behaviour we observe consistently. Users default to whichever model is most prominent in the UI, and “most prominent” typically means most expensive. Opus when Sonnet would have done the job. Vendors have no commercial incentive to route users toward cheaper models. From another customer conversation:

“We definitely know that people are using just all Opus. The people that are using up their tokens, they’ll continue to do that unless there’s a way to control it. We did not know there was a way to control that in Claude. I know there isn’t.”

There is a way to control it, but it doesn’t live in the vendor’s product. The natural place for it is the layer that can see the task category and route at the request level: boilerplate to the leaner model, long-form planning to the heavier one. The Stanford finding that token efficiency is a property of the model rather than the task is precisely what makes routing viable. If heavier models only burned more tokens on harder problems, routing would be useless. They don’t, so it isn’t.

Token usage is highly variable and difficult to predict

The paper’s third observation is that four runs of the same model on the same task can produce up to 30x variance in total token cost4. The most expensive run on a given problem costs roughly twice the cheapest run on average. As cost goes up, predictability goes down.

More pointedly: the authors test...

Observations on AI agent token consumption

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast