Running 26B and 35B LLMs at Full Speed on €990 of Used Hardware – No Cloud

dexmac2212 pts0 comments

Running 26B and 35B LLMs at Full Speed on €990 of Used Hardware — No Cloud Required | by Gian Luca Bailo | Jun, 2026 | AI AdvancesSitemapOpen in appSign up<br>Sign in

Medium Logo

Get app<br>Write

Search

Sign up<br>Sign in

AI Advances

Democratizing access to artificial intelligence

Running 26B and 35B LLMs at Full Speed on €990 of Used Hardware — No Cloud Required

How a secondhand gaming PC keeps pace with a single RTX 3090 for local AI inference — and what I learned measuring it honestly, including the speculative-decoding “speedup” that turned out to be a measurement mirage.

Gian Luca Bailo

11 min read·<br>Jun 10, 2026

Listen

Share

Press enter or click to view image in full size

The €990 machine: an RTX 4070 and an RTX 2070 SUPER sharing a case nobody would mistake for a workstationOn June 3, 2026, Google added a 12B model to the Gemma 4 family it launched back in April. The pitch: frontier-ish performance that “runs on laptops” with 16 GB of VRAM, benchmarks nearly matching the larger 26B sibling. The tech press called it a milestone for local AI.<br>We were already running the whole family. The 12B at 50 tokens per second. The 26B at 83. And a 35B mixture-of-experts model from Alibaba at 73. All on the same machine — assembled from secondhand parts for €990.<br>That’s the story worth telling. Not which model dropped this week, but the fact that the floor for serious local inference has quietly fallen through. You no longer need a workstation. You need two used gaming cards, and the patience to measure what they’re actually doing — because, as it turns out, one of the “obvious” speed tricks makes things slower on this kind of setup, and you only catch it if you test honestly.<br>The Machine<br>Nothing here is exotic, and nothing is new. Every part is sitting on a secondhand marketplace right now, today.<br>Press enter or click to view image in full size

Every part bought used. Two gaming GPUs, 20 GB of combined VRAM, ~€990 all inThe trick is the pairing. A 2070 SUPER next to a 4070 gives you 20 GB of combined VRAM — not from one expensive card, but from two that nobody wanted anymore, bought at the bottom of their depreciation curve. The 4070 brings modern Ada throughput; the 2070 SUPER brings cheap capacity. Together they punch well above what either does alone.<br>The OS is antiX Linux — SysV init, no systemd. Lean, predictable, and it stays out of the way. On a machine where every gigabyte of RAM counts, that minimalism is a feature, not nostalgia.<br>llama.cpp, Not Ollama<br>Ollama is the comfortable choice, and for good reason. It fetches models, serves a clean API, and works the moment you install it. For a first experiment it’s excellent.<br>But comfort has a cost: it hides the exact controls you need once you start pushing the hardware. llama.cpp exposes every one of them — per-GPU layer distribution (--tensor-split), flash attention, expert offloading for MoE models, and speculative decoding via Multi-Token Prediction (merged to mainline in May 2026, PR #22673).<br>Here’s the picture across all three models — and one note on method that matters more than it sounds. Every number here is measured on a sustained 2000-token generation, not a short burst. That distinction turned out to be the whole ballgame, as you’ll see in the MTP section.<br>Press enter or click to view image in full size

Sustained throughput across backends. llama.cpp opens a 35–38% gap on the MoE models; for Qwen, the IQ4_XS quantization adds another jump to 73 tok/sPress enter or click to view image in full size

The same numbers in full. Ollama runs its default Q4_K_M; the llama.cpp gains are measured at the same quantization, with Qwen’s best-config IQ4_XS shown separatelyllama.cpp beats Ollama by 35–38% on the MoE models, measured at the same quantization (both running Q4_K_M, the default Ollama pulls). For Qwen there’s a second, separate gain on top: switching to the IQ4_XS quantization — same quality, fewer operations per token on this architecture — takes it from 61.9 to 73 tok/s . Two independent levers, kept apart on purpose: backend first, quantization second. (More on why I’m so careful to isolate them in the MTP section — that’s where mixing variables nearly cost me a wrong conclusion.)<br>One honest caveat: Ollama v0.30.5 is no longer naive about MoE models. It runs Qwen3.6 35B-A3B — 21 GB, a full gigabyte over total VRAM — at a respectable 46 tok/s, clearly doing some automatic expert scheduling under the hood. The gap with llama.cpp is real, but it isn’t the chasm older comparisons suggested. If you never leave Ollama, you’re not stranded. You’re just leaving a third of your throughput on the table.<br>And that gap is the whole point. Where llama.cpp earns its complexity is the control surface : -ncmoe lets you pick exactly how many MoE expert layers spill to CPU RAM (more on this below); --tensor-split sets per-GPU layer ratios at single-percent granularity; --no-mmap stops cache thrashing on the big models.<br>Model 1: Gemma 4 26B — Fits Cleanly,...

full ollama models llama running from

Related Articles