What actually runs well on a 16 GB MacBook — Prasad Khake ← Writing What actually runs well on a 16 GB MacBook<br>Jun 1, 2026<br>Almost every LLM benchmark you read runs on a datacenter GPU. That tells you nothing about the machine actually on your desk. So I measured it: which models run well on a MacBook Air 15-inch (M3, 16 GB) — a mainstream, mid-range Mac — and where it falls over.
Short version: a 16 GB Mac is a genuinely useful local-LLM machine up to about 8B parameters . Past that, it hits a wall — and the wall isn’t subtle.
The numbers
4-bit models via MLX, 256 tokens generated, measured on the machine itself (MacBook Air 15″, M3, 16 GB, macOS 26.5):
ModelGen tokens/secPeak RAMLlama-3.2-1B38.7 0.8 GBPhi-3.5-mini (3.8B)10.62.5 GBQwen3-4B10.82.4 GBQwen3.5-4B9.82.6 GBFalcon3-7B5.74.3 GBLlama-3.1-8B5.14.7 GBQwen3.5-9Bdid not finish —
Generation speed by model — MacBook Air 15″ (M3, 16 GB). The 9B never finishes: it tips into swap and crawls.
The shape is clean:
1B flies (~40 tok/s) — faster than you can read, uses under a gigabyte.
4B-class is the sweet spot — ~10 tok/s, ~2.5 GB. Comfortably conversational, leaves plenty of room for your actual work.
7–8B is the practical edge — ~5 tok/s. Usable for non-interactive tasks (summaries, drafts), a little slow for live chat.
9B is over the line.
The 16 GB wall
The 9B didn’t just run slowly — it never finished a 256-token response in five minutes. Not because the model is huge (a 9B at 4-bit is only ~5–6 GB of weights), but because of what else is using your RAM.
On a 16 GB Mac doing real work, macOS takes ~4 GB, and an editor plus a browser easily take another 6–8 GB. That leaves ~4–6 GB for a model. An 8B (peak ~4.7 GB) just fits. A 9B needs a bit more than you have — so macOS starts paging the model’s weights to SSD, and generation slows to a crawl as it reads them back token by token.
I confirmed this wasn’t a fluke: the 9B failed to finish in two independent runs, including one that started with 67% of RAM free. It might fit on a freshly-rebooted machine with nothing else open — but nobody reboots their laptop to chat with a model. Under the conditions you’ll actually use it, 8B is the ceiling.
The deeper point: on 16 GB, peak RAM matters more than tokens/sec. The speed differences between a 4B and an 8B are tolerable; the difference between “fits” and “swaps” is the difference between usable and useless.
Three things that almost gave me wrong numbers
Benchmarking on a laptop is easy to get wrong. Three traps I hit (all now handled in the tool):
Cold-start. The very first generation in a process pays a one-time Metal kernel-compilation cost. My first 1B number came in at 33 tok/s; with a throwaway warmup generation first, it was 44. Always warm up before timing.
The laptop sleeping mid-run. I time wall-clock, and at one point the Mac went to sleep between models — which showed up as a model taking 460 seconds to load. It was napping. Run benchmarks under caffeinate so the machine can’t idle-sleep.
Memory accumulating across models. Running all models in one process, MLX didn’t fully release memory between them, so each later model looked slower than it was. The fix: run each model in its own subprocess , so the OS reclaims everything in between.
That last one is also why the tool gives each model a hard timeout — so one too-big model records a clean “did not finish” instead of hanging the whole run.
So what should you run on a 16 GB Mac?
Want it snappy and out of the way? A 4B (Qwen3-4B, Phi-3.5). ~10 tok/s, 2.5 GB, barely touches your headroom.
Want the most capable model that still fits? An 8B (Llama-3.1-8B). ~5 tok/s, and you’ll want to keep other apps light.
Eyeing a 9B+? Either get 24 GB+, or accept that you’ll be closing everything else first.
The tool that produced these numbers is open source: ondevice-bench — point it at your own machine and models.
I’m Prasad Khake — I make LLMs run well on real, on-device hardware, and build the products around them. More measurements like this in On Device.
Share
Generation speed by model — MacBook Air 15″ (M3, 16 GB). The 9B never finishes: it tips into swap and crawls.
The shape is clean:
- **1B flies** (~40 tok/s) — faster than you can read, uses under a gigabyte.<br>- **4B-class is the sweet spot** — ~10 tok/s, ~2.5 GB. Comfortably conversational, leaves plenty of room for your actual work.<br>- **7–8B is the practical edge** — ~5 tok/s. Usable for non-interactive tasks (summaries, drafts), a little slow for live chat.<br>- **9B is over the line.**
## The 16 GB wall
The 9B didn't just run slowly — it never finished a 256-token response in five minutes. Not because the model is huge (a 9B at 4-bit is only ~5–6 GB of weights), but because of what *else* is using your RAM.
On a 16 GB Mac doing real work, macOS takes ~4 GB, and an editor plus a browser easily take another 6–8 GB. That leaves ~4–6 GB for a model. An 8B (peak ~4.7 GB) *just* fits. A 9B needs a bit more than you have — so...