Why Token Optimization Is a Gift to the Hyperscalers
UncoverAlpha
SubscribeSign in
Why Token Optimization Is a Gift to the Hyperscalers
UncoverAlpha<br>Jun 29, 2026<br>∙ Paid
39
Share
Hey everyone,<br>A few weeks ago I wrote a piece called Most of the Economy Won’t Run on the Best Model, where I argued that the AI market will bifurcate: the frontier model goes to the small slice of work where intelligence is unbounded in economic value (drug discovery, novel math, the hardest agentic reasoning), and the middle of the economy — classification, extraction, summarization, routine code, support — runs on the cheapest model that clears the quality bar.<br>This article is the sequel to that one. Who captures the value when the world switches from token maxing to token optimization?<br>The shift away from always-buy-the-best-model is, on its surface, bearish for the AI labs and looks like it should compress the whole stack. But it is quietly one of the most bullish structural setups for the three hyperscalers — Microsoft, Amazon, and Google.<br>Let me explain why.<br>Think about a highway toll road. There are two businesses operating on it. The first is the company that manufactures the cars — they make the actual machine that does the work of getting you somewhere, and there’s a real margin in a car. The second is the company that owns the tollbooth. The tollbooth owner doesn’t care what car you drive. Ferrari, Toyota, or a 12-year-old used Honda. Every one of them pays the same toll to cross the bridge. Now imagine a world where, suddenly, everyone realizes they were commuting to work in a Ferrari for no reason, and they all downgrade to the cheaper Honda to save money. The carmaker’s revenue per vehicle collapses, but people don’t drive less because they switched to a cheaper car. They drive more, because now it’s cheap enough to justify trips they’d never have taken before. And every one of those extra trips still crosses the bridge. The tollbooth’s revenue goes up.<br>In the AI stack, the AI labs are the carmakers. The hyperscalers are the tollbooths. And in the next months, we are entering a period where most of the economy is trading down from the Ferrari to the Honda, while simultaneously driving 10x more miles.<br>From token maxing to token optimization<br>For the last 18 months, the dominant behavior in enterprise AI was token maxing. You found the single best model on the leaderboard, you pointed every workload at it, and you didn’t think too hard about cost, because the whole thing was a pilot and the bill was small relative to the perceived upside of “does this even work?”<br>That era is ending fast. Companies are essentially blowing through their AI budgets in a quarter. Altman said recently that the “my company spent my entire 2026 budget in Q1, can you make this more efficient?” complaint went from something that “never came up” to “all of a sudden a huge issue.” And it’s not just him; you can see it across the industry, from companies like Salesforce and Meta to many other smaller ones blowing their planned yearly budget in a matter of days.<br>The natural behavior change from this is that instead of having one model for everything, companies start routing: a small, cheap, often open-weight model handles the 80% of requests that are less complex, and only the genuinely hard requests escalate to the frontier. This is token optimization, and it doesn’t reduce total token consumption — it accelerates it. The moment inference gets cheap enough, you stop rationing it. You run the agent in a loop. You let it read the whole codebase. You re-run it five times and vote on the answer. A single coding-agent session now chews through millions of tokens of context where a chatbot query used a few thousand.<br>You can see this in the hard numbers the hyperscalers themselves disclose. Microsoft said it processed over 100 trillion tokens in a single quarter in 2025, up 5x year-over-year, with a record 50 trillion in one month alone. By its fiscal Q3 2026 call, Microsoft said over 300 customers were on track to process more than a trillion tokens each on Foundry this year — and that this was accelerating 30% quarter-over-quarter. Google went from 480 trillion tokens per month at I/O in May 2025, to 980 trillion by July, to 1.3 quadrillion by October — and in its Q1 2026 filing disclosed that its first-party models alone were processing more than 16 billion tokens per minute via direct API, up 60% in a single quarter.<br>At the same time, the price per unit of capability is falling substantially— the Stanford HAI AI Index found inference cost for GPT-3.5-level performance fell more than 280-fold in two years, and a16z pegs the decline at roughly 10x per year for any fixed capability level. And yet, total tokens processed are growing several-fold per year.<br>Where the margin lives in a routed token economy<br>When a company calls the SOTA model directly through, say, an AI lab’s first-party API, the lab captures the full economic rent of the token....