I tested 8 LLM models on Linux without using the GPU

giuliomagnifico1 pts0 comments

Can You Run LLMs Locally Without a GPU? I Tested 8 Models on Linux

Log in

Sign up

Sign in

Log in

Subscribe

Can You Run LLMs Locally Without a GPU? I Tested 8 Models on Linux

Want to run AI models locally without expensive hardware? I tested 8 LLMs on a CPU-only machine to find out what works and what doesn’t.

Bhuwan Mishra

15 May 2026<br>10 min read

On this page

For the longest time, I assumed running LLMs locally needed a decent GPU. That’s what most guides implied, and honestly, that’s how the ecosystem felt not too long ago. But after digging into recent tools and actually trying things out on CPU-only setups, that assumption doesn’t really hold anymore.<br>Newer model formats like GGUF and aggressive quantization (think 4-bit variants) have made these models much smaller and lighter. At the same time, runtimes such as Llama.cpp have become efficient enough that CPUs (yes, even older ones) can run them without completely falling apart.<br>That said, I quickly realized something more important: just because a model runs doesn’t mean it’s usable .<br>While testing, I found that the real metric that matters isn’t model size or even RAM usage, it’s actually tokens per second. A model providing a response at 3–5 tokens per second technically works, but it feels painfully slow in practice. On the other hand, once you get into the 15–30 tok/s range, things start to feel responsive enough for everyday use.<br>So instead of just listing models that can run on CPU, I focused on ones that are actually usable on low-end machines. This list is based on my own experimentation.<br>If you're working with an older laptop, Raspberry Pi, or basic desktop, this guide would be helpful for running your local AI model successfully and speedily.<br>What “Runs well on CPU” actually means<br>CPU performance varies wildly depending on model size and quantization. Formats used by tools like llama.cpp let you run models in reduced precision. Q8 offers better quality but is slower than Q4_K, which is much faster but comes with slightly reduced quality.<br>I found models ranging from ~40+ tokens/sec for tiny models all the way down to ~4 tokens/sec for larger 4B models. It completely changes how usable a model feels.<br>I would say, 1B-2B models consistently offer the best balance. They're small enough to fit comfortably within 8 GB RAM (with quantization) and maintain decent token speeds. Additionally, they are capable of handling basic reasoning and producing useful responses.<br>From my experience, Q4_K_M quantization usually hits the best balance. It provides fast response times, consumes low RAM, and produces acceptable output quality for most tasks. It significantly improves tokens per second, sometimes enough to move a model from painfully slow to actually usable.<br>My hardware on which I'm performing these tests<br>I'm performing these tests on an Intel i5-generation CPU laptop with around 12 GB of RAM. I’m not running these tests on a workstation or anything close to “AI-ready” hardware. This is a fairly typical older laptop. It's the kind many Linux users already have lying around.<br>Though the device comes with an Integrated Intel UHD Graphics 620 GPU, it is irrelevant for LLMs here. While some tools experiment with iGPU acceleration, in practice, all meaningful inference in my tests is CPU-bound.<br>I deliberately stuck to this machine because it reflects a realistic baseline. If something runs well here, it will likely run on older laptops and low-end desktops without any upgrades.<br>With around 12 GB RAM, 3B–4B models fit comfortably (especially with Q4 quantization). Anything beyond that requires compromises, including swap, resulting in slower performance.<br>While testing, I kept asking: Would I actually use this daily on this machine? If a model felt sluggish, I treated it as impractical. Whereas if it responded smoothly, even at smaller sizes, it made the cut.<br>Quick reality table

ModelEval RateDisk SizeQwen 3 0.6B~34–36 tok/s~500 MBTinyLlama 1.1B~25–28 tok/s~638 MBGemma 3 1B~18.6 tok/s~815 MBGemma 4 E2B~9.9 tok/s~7 GBGranite 4 3B~8.5–9 tok/s~2 GBPhi 4 Mini 3.8B~6.90 tok/s~2.5 GBOpenHermes 7B~4.1–4.3 tok/s~4.1 GBMinistral 3 8B~3.16 tok/s~6 GB

8 LLMs that actually make sense on CPU<br>Let's dive into the LLMs. I used Ollama in this setup.<br>Qwen 0.6B<br>I started with Qwen 3 0.6B, mainly to establish a baseline for how fast a tiny model can run on a CPU. Qwen models are known for being efficient, and this 0.6B variant is about as lightweight as it gets while still being usable.<br>To run it locally, I used ollama command:<br>ollama run qwen3:0.6b --verboseThe --verbose flag exposes detailed metrics like token evaluation rate, total duration, and prompt processing speed. I only used it for this initial run to get a clearer picture of performance.<br>The results were honestly impressive. I consistently saw ~34–36 tokens/sec eval rate. In practical terms, this feels instant. Responses stream smoothly without noticeable delay.<br>Of course, this comes with tradeoffs. The...

models model without llms actually tokens

Related Articles