AI's Plummeting Prices Are a Software Story, Not a Hardware One
SubscribeSign in
AI's Plummeting Prices Are a Software Story, Not a Hardware One<br>This has made local, open-weight models a real competitor to the frontier
James Wang<br>May 19, 2026
33
Share
Why is model inference getting cheaper? How did I drop a soon-to-be $2,000+/month bill for AI agents to next to nothing? And why are local models on commodity hardware potentially “good enough” for most people?<br>There are two macro trends here that feed directly into each other.<br>First, AI inference costs, as I’ve mentioned before, have been dropping 70-90% per year. Guido Appenzeller coined the term “LLMflation” through his original observation that costs have “dropped by a factor of 1,000 in three years.” No matter how many times I say it—and however many smart observers point it out—it still shocks most people because AI feels like it’s getting more expensive.<br>That’s because costs are dropping for the same capacity (same model, same query), and we’re constantly ramping up what we use (bigger model, more expensive query). It’s the same reason why, despite Moore’s Law (which is slower than LLMflation), computers don’t cost $0.00001—we made computers bigger as we went, even as their cost exponentially plummeted.<br>This, however, is an old story. The interesting part of the piece is what drives this plummeting cost. It’s not (mainly) hardware. It’s software.
A chart of the 10x drop per year from a16z.<br>The second is a consequence of that: local, open-weight models on kind-of-old commodity hardware are becoming more and more competitive with models on the frontier. That obviously has big implications—and potential consequences—on what the frontier labs will ultimately be able to charge for the biggest models.<br>Local AI + Old GPU Beats Frontier Model
I ended up researching and writing this piece accidentally.<br>I’ve always experimented with open-weight models myself, even when it made no sense to run them. My (public) history playing around with language models goes back to 2014, when I published an implementation of a Google model paper for a data science certification. It’s always largely been a hobby/learning experience with no practical output—until recently.<br>A few weeks ago, I started playing with Qwen 3.6 27B—released about a month ago in April 2026. I didn’t run it on an Nvidia H100/B200 or some other data center GPU. I ran it on a consumer gaming card, an Nvidia RTX 3090 Ti, released in 2022.<br>While I heard good reviews of its capabilities, I didn’t expect much of it. To my surprise, it felt like running Claude Sonnet, Anthropic’s model one tier below the biggest model (the one they encourage people to run as their “everyday” model).<br>Sure, it isn’t Opus (Anthropic’s biggest, priciest model) or GPT-5.5 Pro (OpenAI’s biggest model), but this thing is open-weight/free and runs on a 4-year-old consumer-grade GPU!<br>Of course, “feelings” don’t mean much for model quality. Fortunately, Anthropic forced my hand on finding alternatives to how I run my rather extensive agent workflow. I detailed that workflow here (it’s still one of my most popular pieces ever). As a note, that was a (relative) eternity ago. I have way more agents running at this point—enough so that I need to actually have agents that manage my agents (I reinvented the org chart…).
How many agents do I have? Who can say anyway? I’m sure it’s like having too many children. At some point you just start rounding.<br>The Coming Anthropic Agent Crackdown
Why did they force my hand? Well, starting June 15, 2026, claude -p—which, as per my agents article, is how most of my autonomous AI agents run—is no longer included in the subscription. Instead, you get $200 worth of credits, which at full API rates is 25x more expensive. So, how much does my agent stack cost? Am I somewhere close to $200?<br>Uh, no. My automated systems alone would run north of $2,000-$3,000/month at those rates, and the all-in personal number is, well, higher. I do not begrudge Anthropic cracking down on this. I may have been following guidelines from Boris Cherny, head of Claude Code, on acceptable use... but still, that’s a lot of money multiplied across a ton of users.<br>So an experiment that would otherwise have been “interesting but academic” suddenly had a budget attached to it. Dropping down from Opus to Sonnet for almost everything only got me down to around $1,000/month (and, for some of it, I didn’t get acceptable results from Sonnet).<br>For part of it, I could use codex exec (OpenAI's equivalent of claude -p) on my $20 ChatGPT plan. Replace one frontier model with another. But $20 isn't enough—especially since I need GPT 5.5 for the harder tasks Sonnet fails at.<br>The question is, can Qwen 3.6 27B (meaning 27B parameters) really step in for Sonnet (a model that likely has hundreds of billions of parameters)? Well, the public benchmarks seem to suggest it can.
Qwen 3.6 27B holding its own against frontier offerings on...