Qwen 3.6 27B is the sweet spot for local development

stared2 pts0 comments

Qwen 3.6 27B is the sweet spot for local development - Quesma Blog

Back to Blog

Qwen 3.6 27B is the sweet spot for local development<br>Piotr Migdał 29 June 2026

Download PNG

I’ve been disappointed by local models in the past. But then I checked Qwen 3.6, and I was in awe.<br>For me it’s the first local model that actually makes sense as a general intelligence.

It comes in two variants, a mixture-of-experts model Qwen 3.6 35B A3B, and a dense Qwen 3.6 27B - slower, but more powerful. The one I recommend!

Let me share my impressions, and show that you can run it too.

It’s hot, literally. When my knees started to melt, I grabbed a phone-attached thermal camera and took a photo.

Qwen 3.6, rightfully, got a lot of coverage on Hacker News. The most common statement about Qwen 3.6 27B is that it punches above its weight - see Will it Mythos?. And I think it is a well-deserved sentiment.<br>It will make your computer hot, but it’s worth it!

Testing the waters

Simon Willison uses “penguins on a bicycle” as a smoke test (see for Qwen 3.6 35B A3B and then Qwen 3.6 27B). I usually go with constrained writing.

A year ago these kinds of things were state of the art, needing a unique, and insanely expensive GPT-4.5, see vibe<br>translating Quantum Flytrap.

I also asked it to write an 8 line poem about Zouk dance and quantum physics, see the transcript.<br>The thought process made sense, both in terms of deliberation on quantum terms, and rhymes.

Then I asked in OpenCode to create a hexagonal minesweeper using pnpm. It worked:

It worked on the first go, from a single prompt, with a proper Node package.<br>The mixture-of-experts Qwen 3.6 35B A3B was faster… but ignored my instruction to create a package, and did it in a single index.html.

Real work

Sure, creative writing about quantum mechanics, or yet another clone of a minesweeper, is rarely a day job.<br>But Qwen 3.6 27B is decent at regular tasks as well.

Prompt by a friend, Maciej Cielecki, at AI Tinkerers Warsaw.

It worked for a few minutes and created this:

By standards of current frontier models, it’s unremarkable.<br>But it is already a practical job. It worked, was reactive, defaults were nice - all from a single, short prompt.

Running Qwen 3.6 locally with llama.cpp

Running local models is easier than ever. A few CLI lines and you’re off.

I recommend llama.cpp - a direct, open source tool that allows running models on various devices. You don’t need Ollama, and frankly - I would recommend against using that on ethical grounds.

First, we go to Hugging Face, to get proper quantization, i.e. a model with reduced size - popular ones are by unsloth or bartowski, among others.<br>Default models usually come with BF16 precision. A common 8-bit quantization saves half the space at almost no cost to quality. Going further down the road, models are smaller (and potentially - faster), but at the cost of quality, see this comparison for 27B and another one for 35B A3B.

We grab unsloth/Qwen3.6-27B-MTP-GGUF:Q8_0, an 8-bit quantization with support for multi-token prediction (MTP).

llama-server -hf unsloth/Qwen3.6-27B-MTP-GGUF:Q8_0 \<br>--spec-type draft-mtp -ngl 999 -fa on -c 65536 --jinja --port 8080<br>What it does:

-hf unsloth/Qwen3.6-27B-MTP-GGUF:Q8_0 grabs from Hugging Face, on the next runs will reuse that

-m ~/models/Qwen3.6-27B-Q8_0.gguf use instead if you already have it

draft-mtp we use a fast model to predict subsequent tokens, speeds up things

-ngl 999 for putting all layers to GPU

-fa on flash attention is on

-c 65536 context size set to 64k tokens (this we can tweak, as Qwen 3.6 27B native context is 256k)

--jinja for tool calling support

--port 8080 better to pin port, as it will be used by other configs

If you open http://127.0.0.1:8080, you can directly chat with it.

Precisely the same server can be used for vibe coding. Choice of agent depends both on one’s goal and subjective taste - for an all-around OpenCode, minimalistic Pi, and self-improving Hermes.

For OpenCode, it is as simple as adding to ~/.config/opencode/opencode.jsonc:

"$schema": "https://opencode.ai/config.json",<br>"provider": {<br>"llama": {<br>"name": "llama.cpp (local)",<br>"npm": "@ai-sdk/openai-compatible",<br>"options": {<br>"baseURL": "http://127.0.0.1:8080/v1",<br>"apiKey": "local"<br>},<br>"models": {<br>"qwen3.6-27b": { "name": "Qwen3.6-27B Q8 +MTP" }<br>},<br>"model": "llama/qwen3.6-27b"<br>If you just want to chat and are a big fan of Terminal, instead of llama-server use llama-cli:

llama-cli -hf unsloth/Qwen3.6-27B-MTP-GGUF:Q8_0 \<br>-ngl 999 -fa on -c 65536 --jinja<br>Measuring performance

Is it fast enough?

I ran a few tests (source is here) on my Macbook Max M5 128 GB, running it with and without multi-token prediction, and comparing both with the 35B A3B model, and also a quantized DeepSeek V4 Flash version DwarfStar4.

tokens / s<br>RAM

Qwen3.6-35B-A3B · 8-bit<br>MLX<br>85 tok/s 85

37 GB RAM 37 GB

llama.cpp<br>93 tok/s 93

44 GB RAM 44 GB

llama.cpp + MTP<br>105 tok/s 105

45 GB RAM 45 GB

Qwen3.6-27B · 8-bit<br>MLX<br>17 tok/s...

qwen llama qwen3 local models model

Related Articles