mlx-serve — Run any LLM on your Mac · MLX + GGUF · faster than LM Studio · OpenAI + Anthropic API
v26.7.1<br>Edit photos with words · talking-character video · ⌃Space launcher · your Ollama apps just work
Run any LLM<br>on your Mac
Your own private AI — chat, coding agents, images, video, and voice — running entirely on your Mac . Free and open source, faster than LM Studio, and your data never leaves the machine.
Download for Mac
View on GitHub
macOS 26+
M1 – M4
Free & open source
★ Star on GitHub
Deep dives
What will you do first?
Create<br>Edit photos with words
“Make the hair blue” — subject, pose & scene survive.
Deep dive →
Animate<br>Turn photos into video
Clips with synced sound — even talking characters.
Deep dive →
Speak<br>Clone any voice
Six seconds of audio, no transcript, all local.
Deep dive →
Code<br>Run Claude Code — free
Your coding agent, offline, no API key.
Deep dive →
Ask<br>Summon AI over any app
⌃Space launcher, voice, phone & schedules.
Deep dive →
Unleash<br>Sandbox your agent
Shell commands hit a Linux VM, not your Mac.
Deep dive →
Compare<br>Faster than LM Studio
+35% on identical models — keeps your library.
Deep dive →
Swap<br>Replace Ollama in one line
Your Ollama apps connect unchanged.
Deep dive →
Accelerate<br>Same answers, 2× faster
Speculative decoding, verified exact.
Deep dive →
Trust<br>Tool calls that don't break
Small-model mistakes repaired mid-flight.
Deep dive →
Any Open ModelOllama CompatiblePhoto EditingImage-to-VideoTalking CharactersStyle LoRAsAgent Sandbox⌃Space LauncherKrea-2 Image GenVoice CloningDiffusionGemmaDeepSeek V4 FlashGemma 4Qwen 3.6Llama 3Private by DesignDocument RAGOpenAI APIAnthropic APIClaude CodeMCP<br>Any Open ModelOllama CompatiblePhoto EditingImage-to-VideoTalking CharactersStyle LoRAsAgent Sandbox⌃Space LauncherKrea-2 Image GenVoice CloningDiffusionGemmaDeepSeek V4 FlashGemma 4Qwen 3.6Llama 3Private by DesignDocument RAGOpenAI APIAnthropic APIClaude CodeMCP
Performance
Faster than LM Studio.<br>Every model.
Identical 4-bit MLX weights, same machine, same prompts. mlx-serve wins every cell — and speculative decoding pushes the lead further on the workloads where it counts. Full LM Studio comparison →
176
tok/s decode · Gemma 4 E2B 4-bit
3,749
tok/s prefill
284B
params, DeepSeek V4 Flash — local
accounts, subscriptions, or cloud calls
Decode = free-form generation · Echo = high-repetition (where PLD shines) · Code = code completion (where the drafter shines).<br>Tokens per second — how fast the AI writes; higher is better. Apple M4 Max (128 GB) · identical 4-bit MLX weights · ctx 4096 · temp 0 · LM Studio (MLX runtime) as baseline.
The marquee capability
Run DeepSeek V4 Flash locally on your Mac
The 284-billion-parameter flagship — running on your own machine, no cloud, no API key. If you have a 96 GB+ Apple Silicon Mac, it's one click away in the Model Browser.
Built on Salvatore Sanfilippo's antirez/ds4 engine — native Metal kernels, byte-validated against the reference forward.
One-click download , served from the same model picker as everything else.
Agent mode and MCP tool calling work on DSV4 too — the full toolset is inlined into the prompt.
A single self-contained binary — kernel sources are embedded and staged at first launch.
Get MLX Core
284B
parameters · running on your desk
96 GB+<br>unified memory
cloud calls
binary
Speculative Decoding
Two ways to draft ahead
Generate multiple tokens per forward pass, verified exactly — so output is identical, just faster. Works on every API surface, streaming or not, tools included: agent loops that echo file contents into edits decode at ~2×. Smart gates keep it on where it pays and step aside where it doesn't.
PLD<br>Prompt Lookup Decoding
Model-agnostic n-gram drafting from the prompt + generated text. Works on every architecture — Gemma, Qwen, Llama, Mistral, Nemotron-H, LFM2.5 — with nothing extra to download.
up to 2× on agent tool loops, echo & RAG
DRAFTER<br>Gemma 4 assistant drafter
A tiny cross-attention drafter reuses the target model's own K/V cache to propose blocks of tokens. Tuned block sizes per target (E2B → 31B).
up to +30% on Gemma 4 code completion
ADAPTIVE<br>Gates that know when to quit
A prompt-time repetition score disables drafting on novel content; a runtime acceptance gate backs off mid-decode when drafts stop landing. You never pay for speculation that won't pay back.
exact output, zero quality cost
Qwen 3.6 native MTP. Models with a trained MTP sidecar (like ddalcu/Qwen3.6-27B-4bit-MTP-MLX-Serve) auto-load it and speculate from the model's own head — up to 1.8× on agent-style edit loops (29 → 51.6 tok/s on Qwen3.6-27B 4-bit, M4 Max), 1.43× on code. The controller watches its own acceptance rate per request and adapts draft depth on the fly. Zero setup — drop in the model and it's on. Speculative decoding, in depth →
Features
A complete local-AI stack
Everything a private AI setup needs in one Mac app — plus the deep dives above when you...