The current AI pricing was always going to go away - Arnon Shimoni
Skip to content
The current AI pricing was always going to go away
The current AI pricing was always going to go away. It just doesn’t make sense.
Microsoft canceled internal Claude Code licenses this week (for whatever reason, even if it’s because they integrated it), Uber blew its entire 2026 AI budget in four months, and GitHub is dropping flat-rate plans across its products.
You’ll see the framing "the AI subsidy era is ending" which is a polite way of what everyone’s been doing when they slap AI features into every tier of their product on a bet that inference costs would keep falling.
They didn’t and the cost curve is bending the wrong way, and the labs have no choice except to pass that along.
Did we collectively forget second order thinking?
Each model generation, costs per token did fall in theory, sometimes 10x less but that was for comparable quality… Lots of people extrapolated and built business models on the extrapolation, which… isn’t how you think about it.
Second-order thinking anyone?
Everyone who deals with road planning knows about is induced demand. Each new capability invents new demand. Highways are the textbook case. Add a lane, you get new commutes. The commutes weren’t there before the lane. AI is the same shape. Cheaper inference doesn’t reduce the bill, it expands what people ask the model to do.
Now my reasoning queries take >4 minutes, where the old ones took 2m… Agentic workflows make 50 calls where the old workflow made one. Unit cost falls, units explode, but still the total spend goes up.
Anyone selling a flat-rate "AI assistant" assumed user behavior wouldn’t change. It did. It always does.
The second is that the supply side stopped cooperating – memory and GPU economics are moving against you.
Memory got 4x more expensive. GPUs got >95% more expensive.
Frontier training and inference run on Nvidia accelerators paired with high-bandwidth memory. The ceiling isn’t transistors anymore, it’s HBM and the advanced packaging that bonds it to the compute die.
Morgan Stanley estimates the bill of material (BOM) on the new NVIDIA VR200s will be 95% higher – memory accounting for 435% growth ALONE.
That ceiling is one factory deep. TSMC’s CoWoS packaging line is the bottleneck for accelerator supply. SK Hynix dominates HBM, with Samsung lagging and Micron behind that. None of them can add capacity overnight. These are 18-to-36 month commitments, minimum, and they were planned for a world that under-forecast demand by an order of magnitude.
So GPU pricing is what scarcity pricing looks like. Top-end accelerators today are roughly 2x more expensive than the previous generation at comparable cluster scale. HBM prices have 4x’d in 18 months. Power and cooling are now real constraints in places nobody used to model power for, which is why every hyperscaler now has a "we’re building a gigawatt campus" story and a nuclear-PPA press release.
Anthropic’s CFO testified under oath this March that the company spent $10 billion on compute and made $5 billion in revenue (Ed Zitron has the math). The labs are underwater on inference. They’re raising prices to keep the lights on.
Companies that sold flat-rate AI-everywhere products are now sitting on a margin problem they architected themselves into. The bet was that one of these curves would bend in their favor. None of them did, probably none of them will, certainly not on the timeline their pricing assumed.
What changes from here
The product question shifts. It stops being "where can we add AI?" and starts being "which use cases earn the inference cost they burn?" That’s a harder roadmap to write. It also changes the pricing surface, which is the part most product teams haven’t internalized.
Three architectures handle a moving cost. None of them are new. All of them are uncomfortable for sales teams that grew up selling seats.
Per-action. Every API call, every generation, every agent step has a price. Revenue scales with cost because they’re indexed to the same underlying event. Twilio has run this since 2008. AWS has run a version of it since 2006. The downside is transparency cuts both ways. Customers see the meter, and they negotiate. The upside is your gross margin doesn’t depend on guessing how hard your power users will hammer the system.
Credits. Prepaid buckets. Customer buys 100,000 credits, burns them down on whatever, refills. Credits smooth cash flow and let you mix model costs behind a single unit, which is the only sane way to handle a product that routes between five different inference providers. The trap is breakage. Snowflake credits are infrastructure, customers understand what they’re buying. Gift-card credits are stranded assets, and customers can tell which one they bought. You only get to do the second one once.
Hybrid. Base seat with included credits and metered overage. Most enterprise sales motions accept this without...