Qwen 3.6 27B is the sweet spot for local development

Qwen 3.6 27B is the sweet spot for local development - Quesma Blog

Back to Blog

Qwen 3.6 27B is the sweet spot for local development Piotr Migdał 29 June 2026

Download PNG

I’ve been disappointed by local models in the past. But then I checked Qwen 3.6, and I was in awe. For me it’s the first local model that actually makes sense as a general intelligence.

It comes in two variants, a mixture-of-experts model Qwen 3.6 35B A3B, and a dense Qwen 3.6 27B - slower, but more powerful. The one I recommend!

Let me share my impressions, and show that you can run it too.

It’s hot, literally. When my knees started to melt, I grabbed a phone-attached thermal camera and took a photo.

Qwen 3.6, rightfully, got a lot of coverage on Hacker News. The most common statement about Qwen 3.6 27B is that it punches above its weight - see Will it Mythos?. And I think it is a well-deserved sentiment. It will make your computer hot, but it’s worth it!

Testing the waters

Simon Willison uses “penguins on a bicycle” as a smoke test (see for Qwen 3.6 35B A3B and then Qwen 3.6 27B). I usually go with constrained writing.

A year ago these kinds of things were state of the art, needing a unique, and insanely expensive GPT-4.5, see vibe translating Quantum Flytrap.

I also asked it to write an 8 line poem about Zouk dance and quantum physics, see the transcript. The thought process made sense, both in terms of deliberation on quantum terms, and rhymes.

Then I asked in OpenCode to create a hexagonal minesweeper using pnpm. It worked:

It worked on the first go, from a single prompt, with a proper Node package. The mixture-of-experts Qwen 3.6 35B A3B was faster… but ignored my instruction to create a package, and did it in a single index.html.

Real work

Sure, creative writing about quantum mechanics, or yet another clone of a minesweeper, is rarely a day job. But Qwen 3.6 27B is decent at regular tasks as well.

Prompt by a friend, Maciej Cielecki, at AI Tinkerers Warsaw.

It worked for a few minutes and created this:

By standards of current frontier models, it’s unremarkable. But it is already a practical job. It worked, was reactive, defaults were nice - all from a single, short prompt.

Running Qwen 3.6 locally with llama.cpp

Running local models is easier than ever. A few CLI lines and you’re off.

I recommend llama.cpp - a direct, open source tool that allows running models on various devices. You don’t need Ollama, and frankly - I would recommend against using that on ethical grounds.

First, we go to Hugging Face, to get proper quantization, i.e. a model with reduced size - popular ones are by unsloth or bartowski, among others. Default models usually come with BF16 precision. A common 8-bit quantization saves half the space at almost no cost to quality. Going further down the road, models are smaller (and potentially - faster), but at the cost of quality, see this comparison for 27B and another one for 35B A3B.

We grab unsloth/Qwen3.6-27B-MTP-GGUF:Q8_0, an 8-bit quantization with support for multi-token prediction (MTP).

llama-server -hf unsloth/Qwen3.6-27B-MTP-GGUF:Q8_0 \ --spec-type draft-mtp -ngl 999 -fa on -c 65536 --jinja --port 8080 What it does:

-hf unsloth/Qwen3.6-27B-MTP-GGUF:Q8_0 grabs from Hugging Face, on the next runs will reuse that

-m ~/models/Qwen3.6-27B-Q8_0.gguf use instead if you already have it

draft-mtp we use a fast model to predict subsequent tokens, speeds up things

-ngl 999 for putting all layers to GPU

-fa on flash attention is on

-c 65536 context size set to 64k tokens (this we can tweak, as Qwen 3.6 27B native context is 256k)

--jinja for tool calling support

--port 8080 better to pin port, as it will be used by other configs

If you open http://127.0.0.1:8080, you can directly chat with it.

Precisely the same server can be used for vibe coding. Choice of agent depends both on one’s goal and subjective taste - for an all-around OpenCode, minimalistic Pi, and self-improving Hermes.

For OpenCode, it is as simple as adding to ~/.config/opencode/opencode.jsonc:

"$schema": "https://opencode.ai/config.json", "provider": { "llama": { "name": "llama.cpp (local)", "npm": "@ai-sdk/openai-compatible", "options": { "baseURL": "http://127.0.0.1:8080/v1", "apiKey": "local" }, "models": { "qwen3.6-27b": { "name": "Qwen3.6-27B Q8 +MTP" } }, "model": "llama/qwen3.6-27b" If you just want to chat and are a big fan of Terminal, instead of llama-server use llama-cli:

llama-cli -hf unsloth/Qwen3.6-27B-MTP-GGUF:Q8_0 \ -ngl 999 -fa on -c 65536 --jinja Measuring performance

Is it fast enough?

I ran a few tests (source is here) on my Macbook Max M5 128 GB, running it with and without multi-token prediction, and comparing both with the 35B A3B model, and also a quantized DeepSeek V4 Flash version DwarfStar4.

tokens / s RAM

Qwen3.6-35B-A3B · 8-bit MLX 85 tok/s 85

37 GB RAM 37 GB

llama.cpp 93 tok/s 93

44 GB RAM 44 GB

llama.cpp + MTP 105 tok/s 105

45 GB RAM 45 GB

Qwen3.6-27B · 8-bit MLX 17 tok/s...

Qwen 3.6 27B is the sweet spot for local development

Related Articles

(no title)

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

Italy's Meloni says Trump 'made up' story that she 'begged' him for photo at G7