Run LLMs locally on your Mac (Apple Silicon) · mlx-optiq
Optimizing compiler · MLX
Quantize, fine-tune
and serve LLMs
entirely on Apple Silicon.
Run large language models locally on your Mac, from M1 to M5.<br>Per-layer sensitivity analysis for mixed-precision weights.<br>LoRA fine-tuning that respects the bit budget.<br>A server that speaks both OpenAI and Anthropic APIs (point Claude Code at your<br>local quant). Send it an image, not just text, on any vision-capable model.<br>No GPU cluster, no API key.
Per-layer bit allocation · sample LLM
Per-layer bit allocation across a 32-layer transformer: tall emerald bars are 8-bit protected layers, short warm-grey bars are 4-bit.
8-bit · sensitive layers<br>4-bit · robust layers
$ pip install mlx-optiq<br>copy
3.1×<br>avg compression vs bf16
+1.4×<br>decode via MTP / drafter
+13.6<br>best Capability Score gain
140k+<br>HF downloads / month
OptIQ Lab — your Mac, your models, no cloud
OptIQ Lab · localhost:8080
A local LLM workbench, one pip install away: chat with sandboxed tools, compare models, fine-tune, and serve. Here a 4-bit OptIQ Qwen3.5 answers at 90 tok/s, fully offline .
01 Pre-built models
Drop-in 4-bit quants. Same weights, smarter bits.
Sixteen production mlx-optiq-quantized LLMs on Hugging Face. Nemotron 3, MiniCPM5, Qwen3.5, Qwen3.6 and Gemma-4 families, from 1 B dense to 35 B-A3B mixture-of-experts. They load directly into stock mlx-lm. No special runtime.
Gemma-4 · new<br>gemma-4-12B-it-OptiQ-4bit
Google's unified text+vision Gemma-4, at 8.3 GB, with image input. Capability Score 68.2 (+6.4 vs uniform-4-bit), one of our largest mixed-precision gains, and the strongest model we ship under 9 GB on disk.
8.3 GB on disk<br>68.2 Capability<br>+6.4 vs U4
Gemma-4<br>gemma-4-31B-it-OptiQ-4bit
The largest single quant we ship. 31 B parameters in 20.8 GB with Capability Score 79.7 (+3.5 vs uniform-4-bit). Pair with the matching -assistant-bf16 drafter for speculative decoding.
20.8 GB on disk<br>79.7 Capability<br>+3.5 vs U4
Qwen3.6<br>Qwen3.6-27B-OptiQ-4bit
Frontier-class reasoning at 17.5 GB with our highest Capability Score (83.0). Bundled MTP head gives ~1.4× decode via optiq serve --mtp.
17.5 GB on disk<br>83.0 Capability<br>+0.5 vs U4
Qwen3.5<br>Qwen3.5-9B-OptiQ-4bit
The default daily-driver. 9 B parameters in 6.6 GB. Capability Score 66.8 (+0.2 vs uniform-4-bit). Long context to 64 k via mixed-precision KV; bundled MTP head for speculative decoding.
6.6 GB on disk<br>66.8 Capability<br>+0.2 vs U4
all 16 models →
02 Quickstart
From zero to a serving LLM in three commands.
Each step is reversible and works with stock MLX tools. mlx-optiq is additive. Skip any of these and you still have a working pipeline.
Install
Pure Python. Pulls in mlx, mlx-lm and huggingface-hub. Python 3.11+ on Apple Silicon.
terminalbash
$ pip install mlx-optiq
ii
Use a pre-built quant
Pre-built mlx-optiq quants load with stock mlx-lm. Per-layer bit assignment is recorded in the model metadata. No special loader required.
generate.pypython
from mlx_lm import load, generate
model, tok = load("mlx-community/Qwen3.5-9B-OptiQ-4bit")<br>out = generate(model, tok, prompt="Explain mixed-precision quantization.", max_tokens=200)<br>print(out)
iii
Serve with mixed-precision KV
The KV cache is its own sensitivity problem. optiq kv-cache measures it once per model; optiq serve serves with the resulting per-layer config behind an OpenAI-compatible API.
terminalbash
# 1-2 min, once per model<br>$ optiq kv-cache mlx-community/Qwen3.5-9B-OptiQ-4bit \<br>--target-bits 5.0 -o ./kv
# OpenAI + Anthropic compatible server on :8080<br># /v1/chat/completions (OpenAI)<br># /v1/messages (Anthropic; works with Claude Code, anthropic SDK, etc.)<br>$ optiq serve --model mlx-community/Qwen3.5-9B-OptiQ-4bit \<br>--kv-config ./kv/kv_config.json \<br>--port 8080
Where to next<br>Each model family has a getting-started guide with model-specific sampling defaults and recommended use cases.<br>Building an agent? Drop llms.txt into your IDE. It's the entire library reference in one Markdown file.
03 What it does
One sensitivity signal. A whole toolkit around it.
A single per-layer KL-divergence pass drives weight, KV-cache and LoRA-rank allocation. The rest of the toolkit (hot-swap adapters, multi-protocol serving with five tested client integrations, image input on the vision models, and the OptIQ Lab GUI for quantize, fine-tune, dataset, and chat workflows) sits around that core.
Mixed-precision weights
Per-layer KL on calibration data picks the bits. Sensitive layers stay high-precision, the rest go low, at the same average size as uniform-4.
Higher accuracy at the same disk size as uniform-4
ii
Mixed-precision KV cache
A separate sensitivity pass on the KV cache. Layer 0 is often 56× more sensitive than average, so uniform 4-bit KV is catastrophic; mixed-precision is not.
Faster long-context decode without breaking quality
iii
LoRA, two ways
Fine-tune with adapter rank scaled by each layer's bits, then keep N adapters mounted on one base and...