Gemma 4 26B on consumer-grade GPU | Algol LabsToggle menu
For the past week I've been running Google's Gemma 4 26B as my daily local agent on a workstation. No API calls, no cloud, no rate limits. Just a single RTX 5070 Ti, llama.cpp compiled from source, and a systemd unit.<br>This post covers what it took to get there, the throughput and accuracy numbers I measured, and an honest read on what this hardware tier is now capable of for tool-using agentic work.
TL;DR<br>li]:mt-3 text-muted-foreground" data-v-733a8408>Model: Gemma 4 26B A4B Instruct — MoE, 26B total / 3.8B active<br>Quant: unsloth/gemma-4-26B-A4B-it-GGUF (UD-IQ4_XS, 12.65 GiB)<br>Hardware: RTX 5070 Ti (16 GB), Ryzen 9 9950X3D, 64 GB DDR5 — Fedora 43<br>Throughput: 5,951 t/s prompt processing, 137.7 t/s token generation (pp2048 / tg64, llama-bench)<br>BFCL accuracy: 89.13% non-live, 63.80% live, 45.12% multi-turn — first published BFCL numbers for Gemma 4<br>Daily use: a week as the model behind opencode, no OOMs, 65k context, sub-second response on agentic loops
Why this matters<br>The narrative that serious agentic work requires either H100s or API access is increasingly dated. Gemma 4's mixture-of-experts architecture activates just 3.8B of its 26B parameters per forward pass, which means a quantized version fits comfortably in 16 GB of VRAM and runs at speeds that match or beat what most managed services give you on shared infrastructure.<br>For someone running a small consultancy, a research project, or a privacy-sensitive workload, the math has changed. A workstation that costs less than a single year of frontier-API credits can now serve a model that benchmarks competitively against hosted offerings on structured tasks.<br>I wanted to test that claim with real numbers rather than vibes.
The hardware<br>[role=checkbox]]:translate-y-0.5 font-semibold" data-v-733a8408>Component[role=checkbox]]:translate-y-0.5 font-semibold" data-v-733a8408>Spec[role=checkbox]]:translate-y-0.5 font-medium" data-v-733a8408>CPU[role=checkbox]]:translate-y-0.5 text-muted-foreground" data-v-733a8408>AMD Ryzen 9 9950X3D — 16C/32T, 5.7 GHz boost, 128 MB L3[role=checkbox]]:translate-y-0.5 font-medium" data-v-733a8408>RAM[role=checkbox]]:translate-y-0.5 text-muted-foreground" data-v-733a8408>64 GB DDR5[role=checkbox]]:translate-y-0.5 font-medium" data-v-733a8408>GPU[role=checkbox]]:translate-y-0.5 text-muted-foreground" data-v-733a8408>NVIDIA RTX 5070 Ti — 16 GB GDDR7, GB203 (Blackwell, sm_120)[role=checkbox]]:translate-y-0.5 font-medium" data-v-733a8408>OS[role=checkbox]]:translate-y-0.5 text-muted-foreground" data-v-733a8408>Fedora 43, kernel 6.18
Unfortunately I bought this GPU at the top of the recent price spikes for €1150 (21% VAT included) so YMMV. The rest of the box is consumer hardware that any serious developer might already own.
The build saga<br>Getting llama.cpp to build cleanly on this hardware required a three-layer compatibility fix. None of the layers are llama.cpp's fault — they're the joint cost of running a brand-new GPU architecture (Blackwell) on a bleeding-edge distro (Fedora 43, GCC 15, glibc 2.41). Each layer is small in isolation; the combination wasn't documented anywhere I could find.<br>Layer 1: CUDA toolkit version pin<br>CUDA 13.x has a known segfault in the MMQ (Matrix Multiply Quantized) kernel on Blackwell. Without MMQ you fall back to cuBLAS and lose 5–6× on prompt processing. Solution: install CUDA 12.8 toolkit alongside the 580.x driver. The driver is forward-compatible with the older toolkit, and dnf install cuda-toolkit-12-8 cleanly drops the toolkit at /usr/local/cuda-12.8/ without touching kernel modules.<br>Layer 2: GCC host compiler downgrade<br>CUDA 12.8's cudafe++ cannot parse GCC 15's headers — it chokes on __is_pointer and __is_volatile builtins. The --allow-unsupported-compiler flag isn't enough; cudafe++ rejects the headers before the flag matters. Solution: install Fedora's compatibility package gcc14-c++ and point CMAKE_CUDA_HOST_COMPILER at g++-14. Only nvcc's host pass uses GCC 14; the rest of the C++ compilation stays on GCC 15.<br>Layer 3: glibc 2.41 noexcept conflict<br>Fedora 43 ships glibc 2.41, which adds C23-conformant declarations of cospi, sinpi, and rsqrt (plus their float variants) marked noexcept(true). CUDA 12.8's math_functions.h declares the same names without noexcept. cudafe++ rejects the mismatch. There's no upstream CUDA fix yet. Workaround: sed six declarations in crt/math_functions.h to add noexcept. The patch is six lines and covered by a single backup file.<br>The full automation, including idempotent re-runs, lives in the Build automation appendix below — ./setup-llama-cpp.sh all runs the whole pipeline.
The systemd unit<br>llama-server runs as a user-level systemd unit on port 8080 with an OpenAI-compatible API. The relevant flags:<br>--n-gpu-layers 99 full GPU offload (31/31 layers)<br>--ctx-size 65536 65k context window<br>--flash-attn on required for mixed KV<br>--cache-type-k q8_0<br>--cache-type-v q4_0 asymmetric KV: bigger gains where it...