I tried running a few local LLM models on my MacBook Air M3

What actually runs well on a 16 GB MacBook — Prasad Khake ← Writing What actually runs well on a 16 GB MacBook Jun 1, 2026 Almost every LLM benchmark you read runs on a datacenter GPU. That tells you nothing about the machine actually on your desk. So I measured it: which models run well on a MacBook Air 15-inch (M3, 16 GB) — a mainstream, mid-range Mac — and where it falls over.

Short version: a 16 GB Mac is a genuinely useful local-LLM machine up to about 8B parameters . Past that, it hits a wall — and the wall isn’t subtle.

The numbers

4-bit models via MLX, 256 tokens generated, measured on the machine itself (MacBook Air 15″, M3, 16 GB, macOS 26.5):

ModelGen tokens/secPeak RAMLlama-3.2-1B38.7 0.8 GBPhi-3.5-mini (3.8B)10.62.5 GBQwen3-4B10.82.4 GBQwen3.5-4B9.82.6 GBFalcon3-7B5.74.3 GBLlama-3.1-8B5.14.7 GBQwen3.5-9Bdid not finish —

Generation speed by model — MacBook Air 15″ (M3, 16 GB). The 9B never finishes: it tips into swap and crawls.

The shape is clean:

1B flies (~40 tok/s) — faster than you can read, uses under a gigabyte.

4B-class is the sweet spot — ~10 tok/s, ~2.5 GB. Comfortably conversational, leaves plenty of room for your actual work.

7–8B is the practical edge — ~5 tok/s. Usable for non-interactive tasks (summaries, drafts), a little slow for live chat.

9B is over the line.

The 16 GB wall

The 9B didn’t just run slowly — it never finished a 256-token response in five minutes. Not because the model is huge (a 9B at 4-bit is only ~5–6 GB of weights), but because of what else is using your RAM.

On a 16 GB Mac doing real work, macOS takes ~4 GB, and an editor plus a browser easily take another 6–8 GB. That leaves ~4–6 GB for a model. An 8B (peak ~4.7 GB) just fits. A 9B needs a bit more than you have — so macOS starts paging the model’s weights to SSD, and generation slows to a crawl as it reads them back token by token.

I confirmed this wasn’t a fluke: the 9B failed to finish in two independent runs, including one that started with 67% of RAM free. It might fit on a freshly-rebooted machine with nothing else open — but nobody reboots their laptop to chat with a model. Under the conditions you’ll actually use it, 8B is the ceiling.

The deeper point: on 16 GB, peak RAM matters more than tokens/sec. The speed differences between a 4B and an 8B are tolerable; the difference between “fits” and “swaps” is the difference between usable and useless.

Three things that almost gave me wrong numbers

Benchmarking on a laptop is easy to get wrong. Three traps I hit (all now handled in the tool):

Cold-start. The very first generation in a process pays a one-time Metal kernel-compilation cost. My first 1B number came in at 33 tok/s; with a throwaway warmup generation first, it was 44. Always warm up before timing.

The laptop sleeping mid-run. I time wall-clock, and at one point the Mac went to sleep between models — which showed up as a model taking 460 seconds to load. It was napping. Run benchmarks under caffeinate so the machine can’t idle-sleep.

Memory accumulating across models. Running all models in one process, MLX didn’t fully release memory between them, so each later model looked slower than it was. The fix: run each model in its own subprocess , so the OS reclaims everything in between.

That last one is also why the tool gives each model a hard timeout — so one too-big model records a clean “did not finish” instead of hanging the whole run.

So what should you run on a 16 GB Mac?

Want it snappy and out of the way? A 4B (Qwen3-4B, Phi-3.5). ~10 tok/s, 2.5 GB, barely touches your headroom.

Want the most capable model that still fits? An 8B (Llama-3.1-8B). ~5 tok/s, and you’ll want to keep other apps light.

Eyeing a 9B+? Either get 24 GB+, or accept that you’ll be closing everything else first.

The tool that produced these numbers is open source: ondevice-bench — point it at your own machine and models.

I’m Prasad Khake — I make LLMs run well on real, on-device hardware, and build the products around them. More measurements like this in On Device.

Generation speed by model — MacBook Air 15″ (M3, 16 GB). The 9B never finishes: it tips into swap and crawls.

The shape is clean:

- **1B flies** (~40 tok/s) — faster than you can read, uses under a gigabyte. - **4B-class is the sweet spot** — ~10 tok/s, ~2.5 GB. Comfortably conversational, leaves plenty of room for your actual work. - **7–8B is the practical edge** — ~5 tok/s. Usable for non-interactive tasks (summaries, drafts), a little slow for live chat. - **9B is over the line.**

## The 16 GB wall

The 9B didn't just run slowly — it never finished a 256-token response in five minutes. Not because the model is huge (a 9B at 4-bit is only ~5–6 GB of weights), but because of what *else* is using your RAM.

On a 16 GB Mac doing real work, macOS takes ~4 GB, and an editor plus a browser easily take another 6–8 GB. That leaves ~4–6 GB for a model. An 8B (peak ~4.7 GB) *just* fits. A 9B needs a bit more than you have — so...

I tried running a few local LLM models on my MacBook Air M3

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy