Local LLMs on a Ryzen 8700G iGPU: 13-15 tok/s on gemma4, 9-12 on qwen3.6

Local LLMs on my TrueNAS, and the frontend I had to build - Loomcycle

§ field report

Local LLMs on my TrueNAS, and the frontend I had to build.

2026-06-28 · by Dennis Gubsky · ~17 min read

Three weekends ago I rebuilt my home server. The old box was a small TrueNAS NAS. The new one is the same role plus a local LLM inference machine, on an AMD Ryzen 7 8700G with 96 GB of DDR5. This post is the field log: which decisions actually mattered, where I lost time, and the moment I realized that once the hardware worked, the frontend layer was its own problem.

The hardware part is most of the words below. The frontend part is shorter but it's the reason this post lives on loomcycle.dev. I tried Open WebUI. I stopped using it after two days. The thing that's replacing it is a chat surface in a new loomboard repo, modelled on what Open WebUI's chat actually gets right and wired to the loomcycle substrate I already work in.

The short version. If you're planning a local-inference box: buy an APU, not a desktop chip with display graphics; the 8700G's Radeon 780M is the entry point, the 2-CU iGPUs on regular Ryzen chips are useless for this. Memory bandwidth is the bottleneck, not core count, so DDR5-6000 CL30 EXPO and lots of capacity beat more cores. The 780M's gfx1103 architecture is not officially supported by ROCm, but HSA_OVERRIDE_GFX_VERSION=11.0.2 plus prebuilt gfx1103 Tensile kernels gets it running 100% GPU on the iGPU. Real-workload throughput on this box: 13-15 tok/s on gemma4:latest , 9-12 tok/s on qwen3.6:latest . GTT memory lets the iGPU address tens of gigabytes regardless of the BIOS UMA cap. A PPT cap at 65 W drops a 90°C inference load to under 60°C with no measurable speed loss, since inference is memory-bound. Open WebUI's chat surface is good, but the configuration UI is opaque and the substrate it can reach isn't the one I work in.

The constraint: one box, three workloads, no DGX budget

This wasn't a clean-sheet build. I already had a small lab NAS running on an Intel N100 with 16 GB of DDR5. Fine as storage and a few small services, weak for everything else. Three workloads landed on me at the same time and I needed one box to host all of them:

VMs for product pre-release testing. JobEmber.ai (loomcycle's production consumer) plus a sibling SaaS in stealth pre-release, plus the smaller experimental instances those workflows spawn. Each wants its own VM so changes don't bleed into each other; a few GB and a couple of vCPUs apiece.

Loomcycle itself, running as a server. Not just the binary I push code into, but the multi-replica deployment I test against. Real memory and real cores.

Local LLM inference. The thing I'd been doing through cloud providers and wanted to pull in-house, partly for cost and partly because some of the work involves untrusted-input agentic loops I'd rather not run through someone else's API quota.

The straightforward shape for "I need local inference at home" is a discrete-GPU rig, an NVIDIA DGX Spark, or one of the soldered-RAM Mac Studios / Strix Halo boxes. The Spark is $4,500-5,500. A Mac Studio with serious unified memory sits in the same band. Strix Halo (Ryzen AI MAX) starts around EUR 4,000 / $5,000 in Europe and everything is soldered: you commit to a fixed RAM amount and a fixed iGPU at purchase. I didn't have spare $4,500 for any of these, and I didn't want to lock the spec at the chip and RAM I happened to pick this year.

So the question stopped being "what's the best inference box" and became "what's the cheapest single box that hosts all three workloads, without soldering me into a corner I'd regret in twelve months."

That reframe was load-bearing. It ruled out the Spark on price. It ruled out Strix Halo on price AND rigidity. It ruled out a discrete-GPU build because the iGPU-plus-fast-system-RAM path is meaningfully cheaper for the model sizes I actually run, and a discrete card would have meant a bigger case, a bigger PSU, and a second thermal envelope to manage on a 24/7 box.

What was left: upgrade the existing NAS. AM5 socket , so the chip is socketed and I can swap it later. DDR5 in DIMMs , so I can add capacity or upgrade timing without rebuilding the system. An APU as the inference engine , because a single iGPU plus a generous pile of system RAM hits the memory-bandwidth-bound workload at the right price point. Total parts cost: ~EUR 2,100 (Ryzen 7 8700G, 96 GB DDR5-6000 CL30, motherboard, new PSU), roughly half the entry price of the rejected options. The three workloads coexist cleanly: the storage half is light enough that an 8-core APU handles it without strain when inference is idle, the VM workload sits in the middle, and the inference workload spikes the iGPU when it's needed.

The upgrade path mattered more than the absolute spec. AMD's next-generation APU drops into this socket. If a future Ryzen APU ships with a 16-CU or 20-CU iGPU on a stronger architecture, the swap is one chip and one BIOS...

Local LLMs on a Ryzen 8700G iGPU: 13-15 tok/s on gemma4, 9-12 on qwen3.6

Related Articles

(no title)

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

The labor share of income in the US is at its lowest post-war level