Burn less, ship more: the case for token optimization
Goodhart's Law, applied to tokens<br>For a few strange quarters, parts of the tech industry decided that the best way to measure AI adoption was to count how many tokens engineers burned. This trend was called tokenmaxxing.<br>Meta built an internal leaderboard ranking all 85,000+ employees by token consumption. Top users earned titles like "Session Immortal" and "Token Legend." In one 30-day period, Meta employees burned through 60.2 trillion tokens, a number that, at standard Anthropic API pricing, would cost around $900 million. Even at enterprise discount rates, the bill likely ran to $100 million or more. A meaningful chunk of that, by all accounts, was deliberate waste.<br>Microsoft ran a similar leaderboard. Salesforce set minimum monthly token spend targets and made everyone's spend visible to their teammates.<br>The message was clear: use enough tokens, or get flagged. But the results were also entirely predictable.<br>The Pragmatic Engineer reports this quote from a software engineer at Microsoft:<br>“I am conscious of not wanting to be seen as “uses too little AI,” and I’m not ashamed to say I need to do tokenmaxxing to do this. Things I do to inflate my token usage metrics: Ask AI questions about the code already in the documentation. The AI pulls up the documentation, processes it, and gives me results 10x slower, but while burning lots of tokens. I could use “readthedocs” [an internal product], but then my token numbers would be lower. Ask the AI to prototype a feature that I have no intention of working on. Prompt it a few more times, then throw the whole thing away. Default to always using the agent, even when I know I could do the work by hand much faster. Then watch it fail”<br>This engineer’s behavior is the logical output of a badly designed incentive: someone who knows exactly what they're doing, knows it's wasteful, and does it anyway because the alternative is being tagged as insufficiently AI-native.<br>This isn't the first time the industry has run headfirst into Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. Tokenmaxxing is the spiritual successor to "lines of code shipped" a regression to the pre-DORA era of developer productivity measurement, where what got counted was an input to the process, not an outcome of it.<br>Token leaderboards are already coming down and, this trend, hopefully, is in the rear-view mirror. What replaces it matters and the case for token optimization is stronger than most teams realize.<br>Three reasons token optimization is now inevitable<br>There are three structural forces making the correction from tokenmaxxing to token optimization inevitable:<br>The cost will catch up<br>Tokenmaxxing as a productivity measure might seem merely wasteful while AI providers are still subsidizing adoption. It will be fiscally irresponsible and embarrassing to justify when the bills start reflecting what inference actually costs.<br>AI providers are pricing tokens to drive developer adoption, not to recover their infrastructure costs. The economics of training and running large models at scale are still deeply unfavorable, and every indication is that the current pricing represents a land-grab phase, not a sustainable market rate. When prices eventually reflect real costs, teams that have built workflows around burning tokens freely will face a reckoning they haven't budgeted for.<br>Even now, CFOs are starting to see AI tooling costs appear as a significant line item without a corresponding improvement in the metrics most engineering managers care about. The data doesn't make a comfortable case for this level of continued spending: incidents per PR up 242%, bugs per developer up 54%, PR revert rates flat despite record token consumption. The productivity gain is real at the individual level (more code shipped, more tasks closed) but more tokens are not buying more reliable, higher quality software.<br>More tokens, same problems<br>The Jellyfish AI Engineering Trends data makes the point plainly: engineers with the largest token budgets produced the most pull requests, but productivity improvement didn't scale with token spend. The correlation between token consumption and output quality simply isn't there.<br>Tokenmaxxing behavior (e.g. running agents on loosely specified tasks or burning through context with throwaway prototypes) is part of the problem. But the deeper issue is structural.<br>We are trying to solve a data problem with more tokens and better models. An AI agent is only as good as the environment it operates in, and it requires precise, complete runtime data about your system to reason correctly about what broke and how to fix it.<br>When you bolt an AI agent onto an observability stack designed to give humans dashboards for assessing system health, the agent inherits all the limitations of that architecture: sampled and aggregated data, siloed context, and no correlation across the system boundaries where complex failures...