Mlx-optiq: per-layer mixed-precision LLM quantization for Apple Silicon

Run LLMs locally on your Mac (Apple Silicon) · mlx-optiq

Optimizing compiler · MLX

Quantize, fine-tune

and serve LLMs

entirely on Apple Silicon.

Run large language models locally on your Mac, from M1 to M5. Per-layer sensitivity analysis for mixed-precision weights. LoRA fine-tuning that respects the bit budget. A server that speaks both OpenAI and Anthropic APIs (point Claude Code at your local quant). Send it an image, not just text, on any vision-capable model. No GPU cluster, no API key.

Per-layer bit allocation · sample LLM

Per-layer bit allocation across a 32-layer transformer: tall emerald bars are 8-bit protected layers, short warm-grey bars are 4-bit.

8-bit · sensitive layers 4-bit · robust layers

$ pip install mlx-optiq copy

3.1× avg compression vs bf16

+1.4× decode via MTP / drafter

+13.6 best Capability Score gain

140k+ HF downloads / month

OptIQ Lab — your Mac, your models, no cloud

OptIQ Lab · localhost:8080

A local LLM workbench, one pip install away: chat with sandboxed tools, compare models, fine-tune, and serve. Here a 4-bit OptIQ Qwen3.5 answers at 90 tok/s, fully offline .

01 Pre-built models

Drop-in 4-bit quants. Same weights, smarter bits.

Sixteen production mlx-optiq-quantized LLMs on Hugging Face. Nemotron 3, MiniCPM5, Qwen3.5, Qwen3.6 and Gemma-4 families, from 1 B dense to 35 B-A3B mixture-of-experts. They load directly into stock mlx-lm. No special runtime.

Gemma-4 · new gemma-4-12B-it-OptiQ-4bit

Google's unified text+vision Gemma-4, at 8.3 GB, with image input. Capability Score 68.2 (+6.4 vs uniform-4-bit), one of our largest mixed-precision gains, and the strongest model we ship under 9 GB on disk.

8.3 GB on disk 68.2 Capability +6.4 vs U4

Gemma-4 gemma-4-31B-it-OptiQ-4bit

The largest single quant we ship. 31 B parameters in 20.8 GB with Capability Score 79.7 (+3.5 vs uniform-4-bit). Pair with the matching -assistant-bf16 drafter for speculative decoding.

20.8 GB on disk 79.7 Capability +3.5 vs U4

Qwen3.6 Qwen3.6-27B-OptiQ-4bit

Frontier-class reasoning at 17.5 GB with our highest Capability Score (83.0). Bundled MTP head gives ~1.4× decode via optiq serve --mtp.

17.5 GB on disk 83.0 Capability +0.5 vs U4

Qwen3.5 Qwen3.5-9B-OptiQ-4bit

The default daily-driver. 9 B parameters in 6.6 GB. Capability Score 66.8 (+0.2 vs uniform-4-bit). Long context to 64 k via mixed-precision KV; bundled MTP head for speculative decoding.

6.6 GB on disk 66.8 Capability +0.2 vs U4

all 16 models →

02 Quickstart

From zero to a serving LLM in three commands.

Each step is reversible and works with stock MLX tools. mlx-optiq is additive. Skip any of these and you still have a working pipeline.

Install

Pure Python. Pulls in mlx, mlx-lm and huggingface-hub. Python 3.11+ on Apple Silicon.

terminalbash

$ pip install mlx-optiq

Use a pre-built quant

Pre-built mlx-optiq quants load with stock mlx-lm. Per-layer bit assignment is recorded in the model metadata. No special loader required.

generate.pypython

from mlx_lm import load, generate

model, tok = load("mlx-community/Qwen3.5-9B-OptiQ-4bit") out = generate(model, tok, prompt="Explain mixed-precision quantization.", max_tokens=200) print(out)

iii

Serve with mixed-precision KV

The KV cache is its own sensitivity problem. optiq kv-cache measures it once per model; optiq serve serves with the resulting per-layer config behind an OpenAI-compatible API.

terminalbash

# 1-2 min, once per model $ optiq kv-cache mlx-community/Qwen3.5-9B-OptiQ-4bit \ --target-bits 5.0 -o ./kv

# OpenAI + Anthropic compatible server on :8080 # /v1/chat/completions (OpenAI) # /v1/messages (Anthropic; works with Claude Code, anthropic SDK, etc.) $ optiq serve --model mlx-community/Qwen3.5-9B-OptiQ-4bit \ --kv-config ./kv/kv_config.json \ --port 8080

Where to next Each model family has a getting-started guide with model-specific sampling defaults and recommended use cases. Building an agent? Drop llms.txt into your IDE. It's the entire library reference in one Markdown file.

03 What it does

One sensitivity signal. A whole toolkit around it.

A single per-layer KL-divergence pass drives weight, KV-cache and LoRA-rank allocation. The rest of the toolkit (hot-swap adapters, multi-protocol serving with five tested client integrations, image input on the vision models, and the OptIQ Lab GUI for quantize, fine-tune, dataset, and chat workflows) sits around that core.

Mixed-precision weights

Per-layer KL on calibration data picks the bits. Sensitive layers stay high-precision, the rest go low, at the same average size as uniform-4.

Higher accuracy at the same disk size as uniform-4

Mixed-precision KV cache

A separate sensitivity pass on the KV cache. Layer 0 is often 56× more sensitive than average, so uniform 4-bit KV is catastrophic; mixed-precision is not.

Faster long-context decode without breaking quality

iii

LoRA, two ways

Fine-tune with adapter rank scaled by each layer's bits, then keep N adapters mounted on one base and...

Mlx-optiq: per-layer mixed-precision LLM quantization for Apple Silicon

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

It's Not Just X. It's Y