Qwen 3.6 27B is the sweet spot for local development - Quesma Blog
Back to Blog
Qwen 3.6 27B is the sweet spot for local development<br>Piotr Migdał 29 June 2026
Download PNG
I’ve been disappointed by local models in the past. But then I checked Qwen 3.6, and I was in awe.<br>For me it’s the first local model that actually makes sense as a general intelligence.
It comes in two variants, a mixture-of-experts model Qwen 3.6 35B A3B, and a dense Qwen 3.6 27B - slower, but more powerful. The one I recommend!
Let me share my impressions, and show that you can run it too.
It’s hot, literally. When my knees started to melt, I grabbed a phone-attached thermal camera and took a photo.
Qwen 3.6, rightfully, got a lot of coverage on Hacker News. The most common statement about Qwen 3.6 27B is that it punches above its weight - see Will it Mythos?. And I think it is a well-deserved sentiment.<br>It will make your computer hot, but it’s worth it!
Testing the waters
Simon Willison uses “penguins on a bicycle” as a smoke test (see for Qwen 3.6 35B A3B and then Qwen 3.6 27B). I usually go with constrained writing.
A year ago these kinds of things were state of the art, needing a unique, and insanely expensive GPT-4.5, see vibe<br>translating Quantum Flytrap.
I also asked it to write an 8 line poem about Zouk dance and quantum physics, see the transcript.<br>The thought process made sense, both in terms of deliberation on quantum terms, and rhymes.
Then I asked in OpenCode to create a hexagonal minesweeper using pnpm. It worked:
It worked on the first go, from a single prompt, with a proper Node package.<br>The mixture-of-experts Qwen 3.6 35B A3B was faster… but ignored my instruction to create a package, and did it in a single index.html.
Real work
Sure, creative writing about quantum mechanics, or yet another clone of a minesweeper, is rarely a day job.<br>But Qwen 3.6 27B is decent at regular tasks as well.
Prompt by a friend, Maciej Cielecki, at AI Tinkerers Warsaw.
It worked for a few minutes and created this:
By standards of current frontier models, it’s unremarkable.<br>But it is already a practical job. It worked, was reactive, defaults were nice - all from a single, short prompt.
Running Qwen 3.6 locally with llama.cpp
Running local models is easier than ever. A few CLI lines and you’re off.
I recommend llama.cpp - a direct, open source tool that allows running models on various devices. You don’t need Ollama, and frankly - I would recommend against using that on ethical grounds.
First, we go to Hugging Face, to get proper quantization, i.e. a model with reduced size - popular ones are by unsloth or bartowski, among others.<br>Default models usually come with BF16 precision. A common 8-bit quantization saves half the space at almost no cost to quality. Going further down the road, models are smaller (and potentially - faster), but at the cost of quality, see this comparison for 27B and another one for 35B A3B.
We grab unsloth/Qwen3.6-27B-MTP-GGUF:Q8_0, an 8-bit quantization with support for multi-token prediction (MTP).
llama-server -hf unsloth/Qwen3.6-27B-MTP-GGUF:Q8_0 \<br>--spec-type draft-mtp -ngl 999 -fa on -c 65536 --jinja --port 8080<br>What it does:
-hf unsloth/Qwen3.6-27B-MTP-GGUF:Q8_0 grabs from Hugging Face, on the next runs will reuse that
-m ~/models/Qwen3.6-27B-Q8_0.gguf use instead if you already have it
draft-mtp we use a fast model to predict subsequent tokens, speeds up things
-ngl 999 for putting all layers to GPU
-fa on flash attention is on
-c 65536 context size set to 64k tokens (this we can tweak, as Qwen 3.6 27B native context is 256k)
--jinja for tool calling support
--port 8080 better to pin port, as it will be used by other configs
If you open http://127.0.0.1:8080, you can directly chat with it.
Precisely the same server can be used for vibe coding. Choice of agent depends both on one’s goal and subjective taste - for an all-around OpenCode, minimalistic Pi, and self-improving Hermes.
For OpenCode, it is as simple as adding to ~/.config/opencode/opencode.jsonc:
"$schema": "https://opencode.ai/config.json",<br>"provider": {<br>"llama": {<br>"name": "llama.cpp (local)",<br>"npm": "@ai-sdk/openai-compatible",<br>"options": {<br>"baseURL": "http://127.0.0.1:8080/v1",<br>"apiKey": "local"<br>},<br>"models": {<br>"qwen3.6-27b": { "name": "Qwen3.6-27B Q8 +MTP" }<br>},<br>"model": "llama/qwen3.6-27b"<br>If you just want to chat and are a big fan of Terminal, instead of llama-server use llama-cli:
llama-cli -hf unsloth/Qwen3.6-27B-MTP-GGUF:Q8_0 \<br>-ngl 999 -fa on -c 65536 --jinja<br>Measuring performance
Is it fast enough?
I ran a few tests (source is here) on my Macbook Max M5 128 GB, running it with and without multi-token prediction, and comparing both with the 35B A3B model, and also a quantized DeepSeek V4 Flash version DwarfStar4.
tokens / s<br>RAM
Qwen3.6-35B-A3B · 8-bit<br>MLX<br>85 tok/s 85
37 GB RAM 37 GB
llama.cpp<br>93 tok/s 93
44 GB RAM 44 GB
llama.cpp + MTP<br>105 tok/s 105
45 GB RAM 45 GB
Qwen3.6-27B · 8-bit<br>MLX<br>17 tok/s...