OMLX v0.3.9 Stable Merges Native MTP (Multi-Token Prediction)

febed1 pts0 comments

Release v0.3.9 · jundot/omlx · GitHub

//releases/show" data-turbo-transient="true" />

Skip to content

Search or jump to...

Search code, repositories, users, issues, pull requests...

-->

Search

Clear

Search syntax tips

Provide feedback

--><br>We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

-->

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

Sign in

//releases/show;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

{{ message }}

jundot

omlx

Public

Notifications<br>You must be signed in to change notification settings

Fork<br>1.3k

Star<br>14.8k

v0.3.9

Latest

Latest

Compare

Choose a tag to compare

Sorry, something went wrong.

Filter

Loading

Sorry, something went wrong.

Uh oh!

There was an error while loading. Please reload this page.

No results found

View all tags

jundot

released this

21 May 16:42

&middot;

1 commit

to main<br>since this release

v0.3.9

8cad121

This is the stable 0.3.9 release, consolidating the 0.3.9.dev1, 0.3.9.dev2, and 0.3.9rc1 pre-releases plus the post-rc stabilization fixes. Huge thanks to everyone who filed issues and sent PRs since 0.3.8. If you hit a bug, please open an issue.

Highlights

Native MTP (Multi-Token Prediction) for Qwen3.5 / 3.6, Gemma 4, and DeepSeek-V4

Turn it on per model in admin settings and supported models predict multiple tokens at once for faster decode. Off by default. Gemma 4 gets MTP on the vision path, so image + text requests decode noticeably faster too.

Source PRs: ml-explore/mlx-lm#990 (Qwen3.5 / 3.6, @AirRunner), Blaizzy/mlx-lm#15 (DeepSeek-V4, @0xClandestine), and @Blaizzy's mlx-vlm for Gemma 4. oQ preserves mtp.* weights via a -mtp suffix on quantized output dirs; pre-converted oQ MTP models are at huggingface.co/Jundot.

DeepSeek V4 Pro / Flash support, including SSD cache

Full V4 model + PoolingCache / BatchPoolingCache ported from ml-explore/mlx-lm#1192 by @Blaizzy, tested against mlx-community/deepseek-v4. Highlights:

F8_E8M0 / fp8 quant branch wired into mlx_lm.utils.load_model.

SSD + prefix cache for V4 : the cache type interface was generalized from 2-tuple (keys, values) to N-tuple state (PoolingCache.state is (buf_kv, buf_gate, pooled)), new on-disk format paged_ssd_cache v3. Without this, V4 sessions silently corrupted across prefix-cache hits.

V4 tool calling end-to-end : DSML-format parsing + emission on OpenAI / Anthropic endpoints, so V4 Pro / Flash drives Claude Code, Codex, and OpenClaw with no extra config.

DFlash now supports Gemma 4

Gemma 4 runs on the DFlash engine (thanks @bstnxbt's dflash-mlx), so the model lineup matches the rest of the pool. The admin quantization picker lights up every DFlash option including an FP16 draft-model boost (#880, thanks @deepsweet), with a configurable prefix cache size (#1120, thanks @yilmazorhan) and draft_window_size / draft_sink_size / verify_mode model settings (#1276).

Chunked prefill (#1224)

A long-context prompt no longer blocks decode for other in-flight requests: prefill advances one chunk per scheduler step, so concurrent requests keep streaming tokens through it. Off by default, toggleable from admin. Thanks @drumtorben.

Major stability improvements on low-memory Macs

oMLX is far more resilient on tight-memory machines. A new memory enforcer measures the same phys_footprint metric the OS uses for jetsam decisions and applies prefill admission control, so the server declines work before it would be killed instead of crashing under pressure. Backed by a hot-cache eviction race fix (#1298), parallelized SSD↔hot block preloading (#1301), per-model cache hit-rate visibility (#1183, all thanks @ivaniguarans), and a real-time memory bar on the admin dashboard (#1278, thanks @beamivalice). oQ can also auto-build a proxy model when the source can't fit in RAM, so large checkpoints are quantizable on smaller boxes (#1136).

ParoQuant support

Adds ParoQuant plus a pluggable custom-quantization loader so additional quant methods plug in without forking the loader path; all load call sites route through the dispatcher (#209, thanks @liang2kl).

New Features

One-command coding agents: omlx launch wires env + model and execs into the agent via a curses TUI picker (#998 @fparrav, #1085 @scaryrawr, #1250 @shannonsands).

Chat multi-tasking: run multiple admin chats in parallel (#1231, @beamivalice).

Admin "Restart Server" button, admin-auth gated (#1194, @jasonpaulso).

Native reasoning in the Responses API survives tool-call round-trips (#1245,...

thanks model admin cache omlx search

Related Articles