The Golem in the Server Rack

The Golem in the Server Rack — Rhyd Media

Skip to content<br>RHYD

The Golem in the Server Rack<br>The trillion-dollar bet on hyperscale GPU infrastructure is the wrong bet — and the alternative is one the same companies are already quietly building.<br>James B. · 1 June 2026 · 18 min read

Author's note: I build software, not chips, but you don't need to be a silicon fabricator to see when an industry is pouring concrete for the wrong building. The trillion-dollar bet currently being placed on hyperscale GPU infrastructure is the wrong bet, and what makes the claim defensible isn't a hot take — it's the alternative the same companies are already, quietly, building for themselves. More on that below.

The dominant cultural frame for AI is the Terminator/Frankenstein story: we built something that's going to wake up, develop a will, and turn on us. It's a compelling myth. It's also the wrong reference for understanding what's actually sitting in the server racks.

A better folk reference is the Golem of Prague. In the legend, the Golem is a creature shaped from clay and animated by a script inscribed on its forehead. It has no will, no malice, no inner life. It is pure, relentless execution. If you tell a Golem to clean the house, it tears the load-bearing walls down to get at the dust behind them. The danger isn't that the Golem rebels. The danger is that it is dangerously obedient to imperfect instructions.

That framing matters, because it shifts the conversation from "will the model become conscious" to "what is this thing actually doing, and what does it cost to do it." Both questions are worth asking. Only the second one is answerable today — and the answer ought to be making investors very nervous.

The Capital Expenditure Bet

Right now, somewhere north of a trillion dollars in capital expenditure is being committed across Microsoft, Meta, Google, Amazon, Oracle, OpenAI, and the rest over the 2024-2028 horizon — Microsoft has guided to roughly $80 billion in FY25 AI infrastructure spending, Meta finished 2025 at $72 billion and has since raised its 2026 guidance to $125-145 billion, and the OpenAI/Stargate consortium signalled $500 billion across four years when announced in January 2025. (The Stargate number is worth flagging as already-moving: the flagship Abilene expansion was cancelled in early 2026, several senior infrastructure leaders left for Meta, and OpenAI has since announced an eight-year, $100 billion AWS deal alongside contracts with Google Cloud and CoreWeave. The CapEx isn't as locked-in as the headline numbers imply, which is itself part of the story.) The bet rests on a specific assumption: that the path forward in AI is to buy as many high-end NVIDIA chips as possible, rack them into hyperscale datacentres, draw gigawatts off the grid, and rent access to centralised models.

The bet is going to age badly. Not because AI is overhyped — it isn't — and not because NVIDIA disappears or these datacentres go dark; they'll produce real compute and real revenue for years. The claim is narrower than that. The return-on-CapEx assumptions currently being modelled won't hold, because by the time the most ambitious of these datacentres finish commissioning, three things will likely already be true:

The chips inside them will have been outclassed on power and cost by purpose-built silicon — including, critically, silicon the hyperscalers themselves are already building.

The grid those datacentres depend on still won't have the headroom to run them at planned utilisation.

The workloads they were designed to serve will have been migrating, for years already, onto custom ASICs — Microsoft's Maia, AWS's Trainium, Google's TPU — with some of the next generation of that silicon running in low Earth orbit.

All three are happening already. Hold that for a moment; I'll come back to the details.

The Memory Wall

When the transformer architecture took off in 2017, the existing tool that happened to be good at large parallel matrix multiplication was the GPU. Researchers used what was on the shelf, NVIDIA was rewarded handsomely for it, and the path-dependency calcified from there.

In fairness to NVIDIA: modern datacentre chips like the H100 and B200 are not the gaming cards they evolved from. Most of the rendering pipeline has been stripped out and the silicon area is now dominated by tensor cores and HBM (high-bandwidth memory) tuned for AI workloads. The "graphics cards repurposed for AI" critique applied more to 2018 than to<br>2026.

But the architecture still inherits a real constraint. Large language model inference is autoregressive: the model produces one token, reads the whole context back, predicts the next token, repeats. For low-batch, latency-sensitive inference — the kind that dominates interactive chat workloads — that loop is memory-bandwidth bound, not compute bound. The expensive floating-point engines spend a meaningful percentage of every cycle waiting on the memory bus. The...

The Golem in the Server Rack

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy