How we fight GPU scarcity without compromise | Equixly
GARTNEREquixly in Gartner's Hype Cycles 2025<br>-->
Start a Pentest<br>Book a Demo
Blog
AI Security
How we fight GPU scarcity without compromise
Alessio Dalla Piazza, Simone Businaro, Paolo Maccacaro, Giorgio Roffo
Table of contents<br>Why round robin doesn’t work for LLM inference<br>What about session affinity?<br>The KV cache: The hidden variable in LLM serving<br>Inspiration: vLLM Semantic Router and llm-d<br>Our solution: a cache-aware routing proxy with auto-scaling<br>Lessons learned<br>Being smart is the only option
GPU scarcity is not a temporary supply-chain hiccup. It’s a structural problem. Hyperscalers are stockpiling capacity, and the primary drivers are simple: demand for AI compute far outpaces what fabs can produce, and the biggest players lock up supply years in advance. On top of that, energy costs keep climbing, pushed higher by geopolitical instability: tensions between the US, Israel, and Iran, port closures disrupting supply chains, and the broader energy market uncertainty that follows. It all adds up to per-hour GPU pricing that keeps going up.
In practice, H100s are already hard to get on short notice, and large B200 clusters in Europe are harder still: expensive, scarce, and often tied to capacity commitments negotiated well in advance.
The dynamics feel familiar if you’ve watched hardware markets shift before. Look at DDR5 memory. In 2025, DDR5 prices rose sharply, with spot markets seeing extreme spikes. The transition from DDR4 didn’t just bring faster speeds. It brought a sustained price premium driven by retooling costs, new fabrication processes, and demand that outpaced supply. GPU pricing follows the same pattern, but amplified. When OpenAI, Google, Meta, and Microsoft are competing for the same wafers, the rest of us feel the squeeze.
Not time to short NVIDIA yet 😉 (not financial advice).
But if you’re building AI-powered products, as we do at Equixly with our AI-driven Penetration Testing Agent, you can’t just wait for prices to normalize. You have to be smart about how you use the GPUs you can get. And that starts with understanding what makes LLM inference fundamentally different from traditional workloads. The core idea is simple: LLM inference shouldn’t be routed like stateless web traffic, because the KV cache makes each inference node stateful.
Why round robin doesn’t work for LLM inference
If you’re not familiar with round robin , it’s the simplest load balancing strategy out there. You have multiple backend servers, and you send each new request to the next in line, cycling through them one by one. Request 1 goes to Server A, request 2 to Server B, request 3 to Server C, then back to A, and so on. It distributes work evenly and doesn’t care about what each server is doing. Simple, effective, and it has worked great for stateless web services for decades.
But LLM inference is not stateless.
When a large language model processes a prompt, it builds an internal data structure called the KV cache (key-value cache). We’ll go deeper on this in a moment, but the short version is: It stores intermediate computations so the model doesn’t have to redo them. This cache is local to the GPU that ran the computation. If the next request in the same conversation lands on a different node, that cache is gone. The new node has to recompute everything from scratch, a process called prefill , which is an expensive phase of inference.
Round robin is completely blind to this. It treats every request as independent and routes them without knowing which node holds useful cached state. The result: redundant computation, wasted GPU cycles, higher latency, and lower throughput.
What about session affinity?
The natural follow-up is: Why not use sticky sessions?
Session affinity (also called sticky sessions) means you pin a user to a specific backend server. Once a user connects to Server B, all subsequent requests go to Server B as well. This is common in web apps that store session state server-side. It sounds like it should solve the caching problem, right?
It’s better than round robin, but it’s still too coarse. It makes assumptions at the wrong level:
It pins users, not prefixes. Two different users might share the same system prompt or document context. Session affinity can’t recognize that and route them to the node that already has that prefix cached.
It creates hot spots. Power users or long conversations pile up on a single node, while other nodes sit idle. You end up with unbalanced GPU utilization across the fleet.
It doesn’t handle node failures well. If a node goes down, every session pinned to it loses its KV cache and must cold-start elsewhere.
It ignores multi-turn cache dynamics. In a multi-step agentic workflow, the relevant cache might span multiple prefixes that don’t map cleanly to a single session ID.
In short, each strategy sees one level deeper than the last. Round robin sees nothing: Every request is...