Running Qwen 3.6 Locally on a Mac Mini M4 with 16GB RAM

Running Qwen 3.6 Locally on a Mac Mini M4 with 16GB RAM | Narek Maloyan Blog

Two days ago Qwen open-sourced Qwen 3.6-35B-A3B — a 35-billion parameter Mixture of Experts model that only activates 3 billion parameters per token. It's Apache 2.0 licensed, ships with a vision encoder, and is reportedly competitive with much larger models on agentic coding benchmarks. GGUF quantizations were up within hours.

Here's the thing: you can run it on a $599 Mac Mini M4 with 16GB of RAM. Not a toy demo — actual usable inference at 17 tok/s, zero swap, 81% memory free. This post is about how to do that, and which tools give you the best experience.

Why 35B-A3B works on 16GB

The naive math says it shouldn't fit. The standard formula for estimating model memory (from BentoML):

Memory (GB) = Parameters (B) × (Bits per weight / 8) × 1.2 overhead 35 × 4 / 8 × 1.2 = ~21GB

21GB for a Q4 quantization. That doesn't fit in 16GB. So how does it work?

The key is the MoE architecture. "35B-A3B" means 35 billion total parameters, but only 3 billion a ctive per token. The model uses 256 total experts with 8 routed + 1 shared active per inference step. The remaining experts sit idle. This is what makes the --mmap trick possible: llama.cpp memory-maps the model file, and the OS only pages in the weights for the currently active experts. Since the hot working set is roughly 3B parameters (~2GB at Q4), it fits comfortably in 16GB with room to spare.

Jock.pl benchmarked this on a Mac Mini M4 16GB: 17.3 tok/s decode, 81% memory free, zero swap. That's not hypothetical — it's a real measurement on the base model Mac Mini.

Why this matters: On benchmarks, the 35B-A3B architecture beats dense models up to 120B on coding and reasoning tasks, while running at the latency of a 3B model. On 16GB RAM. For $0/month. That's the pitch.

Picking your inference tool

There are four main ways to run LLMs locally on a Mac. Here's how they compare for running the 35B-A3B on 16GB specifically:

Tool Ease of setup 35B-A3B on 16GB? Tool calling Notes

llama.cpp Build from source Yes (mmap) Yes The way to do it on 16GB

Ollama One command Yes (uses llama.cpp) Yes MLX backend requires 32GB+

LM Studio GUI app Yes (MLX or GGUF) Yes MLX on 16GB; nice UI

MLX / mlx-lm pip install Tight fit No* Fastest raw speed; no tool calling yet

*mlx-vlm has a PR in progress for tool calling support.

One important detail: Ollama 0.19 (released March 30, 2026) shipped an MLX backend that nearly doubles decode speed — from 58 tok/s to 112 tok/s. But it requires 32GB+ unified memory. On 16GB, Ollama falls back to the llama.cpp backend. Still works, just not the fast path. LM Studio doesn't have this gate and can use MLX on 16GB, which is a real advantage.

Setup 1: llama.cpp with mmap (recommended)

This is the most reliable way to run the 35B-A3B on 16GB. Metal GPU acceleration is enabled by default on macOS — no flags needed.

# Build llama.cpp git clone https://github.com/ggml-org/llama.cpp cd llama.cpp cmake -B build cmake --build build --config Release

# Download the Qwen 3.6 GGUF (Q4_K_M quantization) pip install huggingface_hub huggingface-cli download unsloth/Qwen3.6-35B-A3B-GGUF \ Qwen3.6-35B-A3B-Q4_K_M.gguf \ --local-dir models/

# Run with mmap — this is the key flag ./build/bin/llama-cli \ -m models/Qwen3.6-35B-A3B-Q4_K_M.gguf \ --mmap \ -c 4096 \ -n 512 \ -p "Write a FastAPI endpoint with input validation"

What's happening: --mmap tells llama.cpp to memory-map the model file instead of loading it all into RAM. The OS pages in weights on demand. Because only ~3B parameters are active per token, the actual resident memory stays well under 16GB. The rest of the 21GB model file lives on your SSD and gets paged in only when an expert is activated.

You can also run it as an OpenAI-compatible API server for use with coding agents:

# Start the server ./build/bin/llama-server \ -m models/Qwen3.6-35B-A3B-Q4_K_M.gguf \ --mmap \ -c 4096 \ --port 8080

# Now any tool that speaks the OpenAI API can use it: # aider, opencode, aichat, etc. → http://localhost:8080/v1

GGUF files are also available from bartowski if you prefer different quantization levels.

Setup 2: Ollama (easiest)

If you don't want to build anything from source, Ollama handles everything — download, quantization, API server — in one command. Under the hood it uses llama.cpp, so mmap works the same way.

# Install Ollama, then: ollama run qwen3.6:35b-a3b

That's it. Ollama downloads the GGUF, picks Q4_K_M by default, and starts an OpenAI-compatible API at http://localhost:11434. You can connect coding agents directly:

# Launch with opencode ollama launch opencode --model qwen3.6:35b-a3b

# Or with OpenClaw ollama launch openclaw --model qwen3.6:35b-a3b

Ollama exposes models to anything that speaks the OpenAI API format — aichat, aider, opencode, and many others. Point them at http://localhost:11434/v1.

On 16GB you'll get roughly the same 17 tok/s as raw llama.cpp. The MLX-accelerated path (which...

Running Qwen 3.6 Locally on a Mac Mini M4 with 16GB RAM

Related Articles

(no title)

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

ZCode – Harness for GLM-5.2

Apertus – Open Foundation Model for Sovereign AI