Gemma 4 26B on a consumer GPU: build pain, throughput, and BFCL numbers

Gemma 4 26B on consumer-grade GPU | Algol LabsToggle menu

For the past week I've been running Google's Gemma 4 26B as my daily local agent on a workstation. No API calls, no cloud, no rate limits. Just a single RTX 5070 Ti, llama.cpp compiled from source, and a systemd unit. This post covers what it took to get there, the throughput and accuracy numbers I measured, and an honest read on what this hardware tier is now capable of for tool-using agentic work.

TL;DR li]:mt-3 text-muted-foreground" data-v-733a8408>Model: Gemma 4 26B A4B Instruct — MoE, 26B total / 3.8B active Quant: unsloth/gemma-4-26B-A4B-it-GGUF (UD-IQ4_XS, 12.65 GiB) Hardware: RTX 5070 Ti (16 GB), Ryzen 9 9950X3D, 64 GB DDR5 — Fedora 43 Throughput: 5,951 t/s prompt processing, 137.7 t/s token generation (pp2048 / tg64, llama-bench) BFCL accuracy: 89.13% non-live, 63.80% live, 45.12% multi-turn — first published BFCL numbers for Gemma 4 Daily use: a week as the model behind opencode, no OOMs, 65k context, sub-second response on agentic loops

Why this matters The narrative that serious agentic work requires either H100s or API access is increasingly dated. Gemma 4's mixture-of-experts architecture activates just 3.8B of its 26B parameters per forward pass, which means a quantized version fits comfortably in 16 GB of VRAM and runs at speeds that match or beat what most managed services give you on shared infrastructure. For someone running a small consultancy, a research project, or a privacy-sensitive workload, the math has changed. A workstation that costs less than a single year of frontier-API credits can now serve a model that benchmarks competitively against hosted offerings on structured tasks. I wanted to test that claim with real numbers rather than vibes.

The hardware [role=checkbox]]:translate-y-0.5 font-semibold" data-v-733a8408>Component[role=checkbox]]:translate-y-0.5 font-semibold" data-v-733a8408>Spec[role=checkbox]]:translate-y-0.5 font-medium" data-v-733a8408>CPU[role=checkbox]]:translate-y-0.5 text-muted-foreground" data-v-733a8408>AMD Ryzen 9 9950X3D — 16C/32T, 5.7 GHz boost, 128 MB L3[role=checkbox]]:translate-y-0.5 font-medium" data-v-733a8408>RAM[role=checkbox]]:translate-y-0.5 text-muted-foreground" data-v-733a8408>64 GB DDR5[role=checkbox]]:translate-y-0.5 font-medium" data-v-733a8408>GPU[role=checkbox]]:translate-y-0.5 text-muted-foreground" data-v-733a8408>NVIDIA RTX 5070 Ti — 16 GB GDDR7, GB203 (Blackwell, sm_120)[role=checkbox]]:translate-y-0.5 font-medium" data-v-733a8408>OS[role=checkbox]]:translate-y-0.5 text-muted-foreground" data-v-733a8408>Fedora 43, kernel 6.18

Unfortunately I bought this GPU at the top of the recent price spikes for €1150 (21% VAT included) so YMMV. The rest of the box is consumer hardware that any serious developer might already own.

The build saga Getting llama.cpp to build cleanly on this hardware required a three-layer compatibility fix. None of the layers are llama.cpp's fault — they're the joint cost of running a brand-new GPU architecture (Blackwell) on a bleeding-edge distro (Fedora 43, GCC 15, glibc 2.41). Each layer is small in isolation; the combination wasn't documented anywhere I could find. Layer 1: CUDA toolkit version pin CUDA 13.x has a known segfault in the MMQ (Matrix Multiply Quantized) kernel on Blackwell. Without MMQ you fall back to cuBLAS and lose 5–6× on prompt processing. Solution: install CUDA 12.8 toolkit alongside the 580.x driver. The driver is forward-compatible with the older toolkit, and dnf install cuda-toolkit-12-8 cleanly drops the toolkit at /usr/local/cuda-12.8/ without touching kernel modules. Layer 2: GCC host compiler downgrade CUDA 12.8's cudafe++ cannot parse GCC 15's headers — it chokes on __is_pointer and __is_volatile builtins. The --allow-unsupported-compiler flag isn't enough; cudafe++ rejects the headers before the flag matters. Solution: install Fedora's compatibility package gcc14-c++ and point CMAKE_CUDA_HOST_COMPILER at g++-14. Only nvcc's host pass uses GCC 14; the rest of the C++ compilation stays on GCC 15. Layer 3: glibc 2.41 noexcept conflict Fedora 43 ships glibc 2.41, which adds C23-conformant declarations of cospi, sinpi, and rsqrt (plus their float variants) marked noexcept(true). CUDA 12.8's math_functions.h declares the same names without noexcept. cudafe++ rejects the mismatch. There's no upstream CUDA fix yet. Workaround: sed six declarations in crt/math_functions.h to add noexcept. The patch is six lines and covered by a single backup file. The full automation, including idempotent re-runs, lives in the Build automation appendix below — ./setup-llama-cpp.sh all runs the whole pipeline.

The systemd unit llama-server runs as a user-level systemd unit on port 8080 with an OpenAI-compatible API. The relevant flags: --n-gpu-layers 99 full GPU offload (31/31 layers) --ctx-size 65536 65k context window --flash-attn on required for mixed KV --cache-type-k q8_0 --cache-type-v q4_0 asymmetric KV: bigger gains where it...

Gemma 4 26B on a consumer GPU: build pain, throughput, and BFCL numbers

Related Articles

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy

SpaceX not the behemoth everyone thought