Release v0.3.9 · jundot/omlx · GitHub
//releases/show" data-turbo-transient="true" />
Skip to content
Search or jump to...
Search code, repositories, users, issues, pull requests...
-->
Search
Clear
Search syntax tips
Provide feedback
--><br>We read every piece of feedback, and take your input very seriously.
Include my email address so I can be contacted
Cancel
Submit feedback
Saved searches
Use saved searches to filter your results more quickly
-->
Name
Query
To see all available qualifiers, see our documentation.
Cancel
Create saved search
Sign in
//releases/show;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up
Appearance settings
Resetting focus
You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.
Dismiss alert
{{ message }}
jundot
omlx
Public
Notifications<br>You must be signed in to change notification settings
Fork<br>1.3k
Star<br>14.8k
v0.3.9
Latest
Latest
Compare
Choose a tag to compare
Sorry, something went wrong.
Filter
Loading
Sorry, something went wrong.
Uh oh!
There was an error while loading. Please reload this page.
No results found
View all tags
jundot
released this
21 May 16:42
·
1 commit
to main<br>since this release
v0.3.9
8cad121
This is the stable 0.3.9 release, consolidating the 0.3.9.dev1, 0.3.9.dev2, and 0.3.9rc1 pre-releases plus the post-rc stabilization fixes. Huge thanks to everyone who filed issues and sent PRs since 0.3.8. If you hit a bug, please open an issue.
Highlights
Native MTP (Multi-Token Prediction) for Qwen3.5 / 3.6, Gemma 4, and DeepSeek-V4
Turn it on per model in admin settings and supported models predict multiple tokens at once for faster decode. Off by default. Gemma 4 gets MTP on the vision path, so image + text requests decode noticeably faster too.
Source PRs: ml-explore/mlx-lm#990 (Qwen3.5 / 3.6, @AirRunner), Blaizzy/mlx-lm#15 (DeepSeek-V4, @0xClandestine), and @Blaizzy's mlx-vlm for Gemma 4. oQ preserves mtp.* weights via a -mtp suffix on quantized output dirs; pre-converted oQ MTP models are at huggingface.co/Jundot.
DeepSeek V4 Pro / Flash support, including SSD cache
Full V4 model + PoolingCache / BatchPoolingCache ported from ml-explore/mlx-lm#1192 by @Blaizzy, tested against mlx-community/deepseek-v4. Highlights:
F8_E8M0 / fp8 quant branch wired into mlx_lm.utils.load_model.
SSD + prefix cache for V4 : the cache type interface was generalized from 2-tuple (keys, values) to N-tuple state (PoolingCache.state is (buf_kv, buf_gate, pooled)), new on-disk format paged_ssd_cache v3. Without this, V4 sessions silently corrupted across prefix-cache hits.
V4 tool calling end-to-end : DSML-format parsing + emission on OpenAI / Anthropic endpoints, so V4 Pro / Flash drives Claude Code, Codex, and OpenClaw with no extra config.
DFlash now supports Gemma 4
Gemma 4 runs on the DFlash engine (thanks @bstnxbt's dflash-mlx), so the model lineup matches the rest of the pool. The admin quantization picker lights up every DFlash option including an FP16 draft-model boost (#880, thanks @deepsweet), with a configurable prefix cache size (#1120, thanks @yilmazorhan) and draft_window_size / draft_sink_size / verify_mode model settings (#1276).
Chunked prefill (#1224)
A long-context prompt no longer blocks decode for other in-flight requests: prefill advances one chunk per scheduler step, so concurrent requests keep streaming tokens through it. Off by default, toggleable from admin. Thanks @drumtorben.
Major stability improvements on low-memory Macs
oMLX is far more resilient on tight-memory machines. A new memory enforcer measures the same phys_footprint metric the OS uses for jetsam decisions and applies prefill admission control, so the server declines work before it would be killed instead of crashing under pressure. Backed by a hot-cache eviction race fix (#1298), parallelized SSD↔hot block preloading (#1301), per-model cache hit-rate visibility (#1183, all thanks @ivaniguarans), and a real-time memory bar on the admin dashboard (#1278, thanks @beamivalice). oQ can also auto-build a proxy model when the source can't fit in RAM, so large checkpoints are quantizable on smaller boxes (#1136).
ParoQuant support
Adds ParoQuant plus a pluggable custom-quantization loader so additional quant methods plug in without forking the loader path; all load call sites route through the dispatcher (#209, thanks @liang2kl).
New Features
One-command coding agents: omlx launch wires env + model and execs into the agent via a curses TUI picker (#998 @fparrav, #1085 @scaryrawr, #1250 @shannonsands).
Chat multi-tasking: run multiple admin chats in parallel (#1231, @beamivalice).
Admin "Restart Server" button, admin-auth gated (#1194, @jasonpaulso).
Native reasoning in the Responses API survives tool-call round-trips (#1245,...