Show HN: Mlx-serve – LLM inference server for Apple Silicon, written in Zig

ddalcu1 pts0 comments

mlx-serve — Run any LLM on your Mac · MLX + GGUF · faster than LM Studio · OpenAI + Anthropic API

v26.7.1<br>Edit photos with words · talking-character video · ⌃Space launcher · your Ollama apps just work

Run any LLM<br>on your Mac

Your own private AI — chat, coding agents, images, video, and voice — running entirely on your Mac . Free and open source, faster than LM Studio, and your data never leaves the machine.

Download for Mac

View on GitHub

macOS 26+

M1 – M4

Free & open source

★ Star on GitHub

Deep dives

What will you do first?

Create<br>Edit photos with words

“Make the hair blue” — subject, pose & scene survive.

Deep dive →

Animate<br>Turn photos into video

Clips with synced sound — even talking characters.

Deep dive →

Speak<br>Clone any voice

Six seconds of audio, no transcript, all local.

Deep dive →

Code<br>Run Claude Code — free

Your coding agent, offline, no API key.

Deep dive →

Ask<br>Summon AI over any app

⌃Space launcher, voice, phone & schedules.

Deep dive →

Unleash<br>Sandbox your agent

Shell commands hit a Linux VM, not your Mac.

Deep dive →

Compare<br>Faster than LM Studio

+35% on identical models — keeps your library.

Deep dive →

Swap<br>Replace Ollama in one line

Your Ollama apps connect unchanged.

Deep dive →

Accelerate<br>Same answers, 2× faster

Speculative decoding, verified exact.

Deep dive →

Trust<br>Tool calls that don't break

Small-model mistakes repaired mid-flight.

Deep dive →

Any Open ModelOllama CompatiblePhoto EditingImage-to-VideoTalking CharactersStyle LoRAsAgent Sandbox⌃Space LauncherKrea-2 Image GenVoice CloningDiffusionGemmaDeepSeek V4 FlashGemma 4Qwen 3.6Llama 3Private by DesignDocument RAGOpenAI APIAnthropic APIClaude CodeMCP<br>Any Open ModelOllama CompatiblePhoto EditingImage-to-VideoTalking CharactersStyle LoRAsAgent Sandbox⌃Space LauncherKrea-2 Image GenVoice CloningDiffusionGemmaDeepSeek V4 FlashGemma 4Qwen 3.6Llama 3Private by DesignDocument RAGOpenAI APIAnthropic APIClaude CodeMCP

Performance

Faster than LM Studio.<br>Every model.

Identical 4-bit MLX weights, same machine, same prompts. mlx-serve wins every cell — and speculative decoding pushes the lead further on the workloads where it counts. Full LM Studio comparison →

176

tok/s decode · Gemma 4 E2B 4-bit

3,749

tok/s prefill

284B

params, DeepSeek V4 Flash — local

accounts, subscriptions, or cloud calls

Decode = free-form generation · Echo = high-repetition (where PLD shines) · Code = code completion (where the drafter shines).<br>Tokens per second — how fast the AI writes; higher is better. Apple M4 Max (128 GB) · identical 4-bit MLX weights · ctx 4096 · temp 0 · LM Studio (MLX runtime) as baseline.

The marquee capability

Run DeepSeek V4 Flash locally on your Mac

The 284-billion-parameter flagship — running on your own machine, no cloud, no API key. If you have a 96 GB+ Apple Silicon Mac, it's one click away in the Model Browser.

Built on Salvatore Sanfilippo's antirez/ds4 engine — native Metal kernels, byte-validated against the reference forward.

One-click download , served from the same model picker as everything else.

Agent mode and MCP tool calling work on DSV4 too — the full toolset is inlined into the prompt.

A single self-contained binary — kernel sources are embedded and staged at first launch.

Get MLX Core

284B

parameters · running on your desk

96 GB+<br>unified memory

cloud calls

binary

Speculative Decoding

Two ways to draft ahead

Generate multiple tokens per forward pass, verified exactly — so output is identical, just faster. Works on every API surface, streaming or not, tools included: agent loops that echo file contents into edits decode at ~2×. Smart gates keep it on where it pays and step aside where it doesn't.

PLD<br>Prompt Lookup Decoding

Model-agnostic n-gram drafting from the prompt + generated text. Works on every architecture — Gemma, Qwen, Llama, Mistral, Nemotron-H, LFM2.5 — with nothing extra to download.

up to 2× on agent tool loops, echo & RAG

DRAFTER<br>Gemma 4 assistant drafter

A tiny cross-attention drafter reuses the target model's own K/V cache to propose blocks of tokens. Tuned block sizes per target (E2B → 31B).

up to +30% on Gemma 4 code completion

ADAPTIVE<br>Gates that know when to quit

A prompt-time repetition score disables drafting on novel content; a runtime acceptance gate backs off mid-decode when drafts stop landing. You never pay for speculation that won't pay back.

exact output, zero quality cost

Qwen 3.6 native MTP. Models with a trained MTP sidecar (like ddalcu/Qwen3.6-27B-4bit-MTP-MLX-Serve) auto-load it and speculate from the model's own head — up to 1.8× on agent-style edit loops (29 → 51.6 tok/s on Qwen3.6-27B 4-bit, M4 Max), 1.43× on code. The controller watches its own acceptance rate per request and adapts draft depth on the fly. Zero setup — drop in the model and it's on. Speculative decoding, in depth →

Features

A complete local-AI stack

Everything a private AI setup needs in one Mac app — plus the deep dives above when you...

deep dive model faster studio code

Related Articles