AI Has a Measurement Problem - by Luis Gardea
Luis's Substack
SubscribeSign in
AI Has a Measurement Problem<br>And it's everyone's problem
Luis Gardea<br>Jun 08, 2026
Share
This is my first blog post, so quick intro: I’m a software engineer working on growth and experimentation at Instacart. I’ve been building with AI outside of work and I’m intellectually hooked on this space. I expect to write about AI, tech, and finance, but I’ll go wherever interesting problems are. Opinions are my own. With that out of the way, here’s what I want to talk about.<br>Tokenmaxxing – the symptom
AI tools at tech companies have been rapidly adopted over the last couple of years. Uber has been on the news lately with their COO saying that he can’t draw a line from the rising usage of Claude Code and token spend to useful shipped features. They’ve been spending $500-$2000 per month per engineer, with adoption partially driven by internal leaderboard dashboards ranking teams or engineers by token usage with reports that they blew through their 2026 AI budget in 4 months.<br>Tokenmaxxing has been a live trend at other companies too. Salesforce set minimum daily spend targets and built dashboards showing what employees were spending. Meta’s Claudeonomics leaderboard was created internally by an engineer and has since been killed. Amazon had a similar leaderboard that was pulled after gaming concerns. Jensen Huang said he’d be “deeply alarmed” if a $500k engineer wasn’t burning $250k in tokens per year. And a company (obviously a very large one) reportedly spent $500m on tokens in a single month.<br>EntelligenceAI analyzed over a million pull requests across 2,400+ engineering organizations and found that only 18% of AI coding spend mapped to shipped products that reached real users. The rest went to reactive work, rework, and review friction.<br>So it’s not just that employees are gaming metrics. There’s a measurement vacuum and the gaming is a symptom.<br>The correction to it is also already underway: Uber and others are starting to ration access, cap spend, and pull tools. But the impact of this is also unmeasured (more rational than limitless spend, but still unmeasured). Companies spent blindly and now are cutting blindly. There’s an inability to connect spend to value, and this recurs at every scale. Closing the measurement gap, connecting spend to impact where possible and honestly bracketing where it can’t, is the leverage.<br>Why now?
In 2026, execution is cheap. When execution was the bottleneck, scarcity did your prioritization for you. AI removed the bottleneck, and now the prioritization question is exposed.<br>Cheap to try means more things tried at a higher false-positive rate. Whether the new exploration rate is net positive is itself unanswerable without measurement. Even Anthropic reports this internally: “explosion of new ideas, initiatives, tools, and simulations – far more than we have the capacity to pursue”, with human code review becoming the new bottleneck once generation scaled. Even the most advanced labs are hitting the same wall: with execution getting cheap, the constraint has shifted to deciding what’s worth doing.<br>Quality evidence is mixed, and the mix is the point. Analysis from Georgia Tech found AI-generated CVEs tripled between Q4 ‘25 and Q1 ‘26; Waydev found more accepted code with more rework. But Jellyfish found no statistically significant relationship between AI adoption and bug or revert rates, and METR’s RCT found experienced devs were 19% slower with AI while perceiving a speedup. The evidence is conflicting because nobody measures it cleanly.<br>The obvious half-fix, measuring tasks and not tokens, just moves the gap up. You learn how many PRs were merged, maybe which tokens went to which PR, but not whether the PR mattered. So you learn if more work is getting done, but not the impact it had.<br>The same gap, three layers up
Without measurement, every adoption decision is driven by what peers are doing rather than demonstrated returns. It’s FOMO, wanting what high-status others want, applied to AI spend.<br>Inside a tech company, this shows up as tokenmaxxing. Boards pressure management to show AI adoption because peer companies show AI adoption and management translates that into spend targets (or AI-generated PR percentage). Those targets turn into leaderboards and employees optimize the only visible metric. The whole chain is running on FOMO at every step, Goodhart’s Law meeting Girardian desire. Goodhart’s Law, “When a measure becomes a target, it ceases to be a good measure” shows up everywhere from A/B testing to reward hacking in AI/ML. Tokenmaxxing is that same failure applied to engineering productivity.<br>Applied to AI lab revenue: labs capture the token revenue but it’s impossible for them to separate durable from performative demand. The measurement gap might flatter their quarter but it undermines their planning in the same motion. If even 15-20% of enterprise usage is performative and corrects, the...