Release 4.0.0 · huggingface/transformers.js · GitHub
//releases/show" data-turbo-transient="true" />
Skip to content
Search or jump to...
Search code, repositories, users, issues, pull requests...
-->
Search
Clear
Search syntax tips
Provide feedback
--><br>We read every piece of feedback, and take your input very seriously.
Include my email address so I can be contacted
Cancel
Submit feedback
Saved searches
Use saved searches to filter your results more quickly
-->
Name
Query
To see all available qualifiers, see our documentation.
Cancel
Create saved search
Sign in
//releases/show;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up
Appearance settings
Resetting focus
You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.
Dismiss alert
{{ message }}
huggingface
transformers.js
Public
Notifications<br>You must be signed in to change notification settings
Fork<br>1.2k
Star<br>16.1k
4.0.0
Compare
Choose a tag to compare
Sorry, something went wrong.
Filter
Loading
Sorry, something went wrong.
Uh oh!
There was an error while loading. Please reload this page.
No results found
View all tags
xenova
released this
30 Mar 12:55
·
26 commits
to main<br>since this release
4.0.0
364ebd4
This commit was created on GitHub.com and signed with GitHub’s verified signature .
GPG key ID: B5690EEEBB952194
Verified
Learn about vigilant mode.
🚀 Transformers.js v4
We're excited to announce that Transformers.js v4 is now available on NPM! After a year of development (we started in March 2025 🤯), we're finally ready for you to use it.
npm i @huggingface/transformers
Links: YouTube Video, Blog Post, Demo Collection
New WebGPU backend
The biggest change is undoubtedly the adoption of a new WebGPU Runtime, completely rewritten in C++. We've worked closely with the ONNX Runtime team to thoroughly test this runtime across our ~200 supported model architectures, as well as many new v4-exclusive architectures.
In addition to better operator support (for performance, accuracy, and coverage), this new WebGPU runtime allows the same transformers.js code to be used across a wide variety of JavaScript environments, including browsers, server-side runtimes, and desktop applications. That's right, you can now run WebGPU-accelerated models directly in Node, Bun, and Deno!
We've proven that it's possible to run state-of-the-art AI models 100% locally in the browser, and now we're focused on performance: making these models run as fast as possible, even in resource-constrained environments. This required completely rethinking our export strategy, especially for large language models. We achieve this by re-implementing new models operation by operation, leveraging specialized ONNX Runtime Contrib Operators like com.microsoft.GroupQueryAttention, com.microsoft.MatMulNBits, or com.microsoft.QMoE to maximize performance.
For example, adopting the com.microsoft.MultiHeadAttention operator, we were able to achieve a ~4x speedup for BERT-based embedding models.
ONNX Runtime improvements by @xenova in #1306
Transformers.js V4: Native WebGPU EP, repo restructuring, and more! by @xenova in #1382
New models
Thanks to our new export strategy and ONNX Runtime's expanding support for custom operators, we've been able to add many new models and architectures to Transformers.js v4. These include popular models like GPT-OSS, Chatterbox, GraniteMoeHybrid, LFM2-MoE, HunYuanDenseV1, Apertus, Olmo3, FalconH1, and Youtu-LLM. Many of these required us to implement support for advanced architectural patterns, including Mamba (state-space models), Multi-head Latent Attention (MLA), and Mixture of Experts (MoE). Perhaps most importantly, these models are all compatible with WebGPU, allowing users to run them directly in the browser or server-side JavaScript environments with hardware acceleration. We've released several Transformers.js v4 demos so far... and we'll continue to release more!
Additionally, we've added support for larger models exceeding 8B parameters. In our tests, we've been able to run GPT-OSS 20B (q4f16) at ~60 tokens per second on an M4 Pro Max.
Add support for Apertus by @nico-martin in #1465
Add support for FalconH1 by @xenova in #1502
Add support for Cohere's Tiny Aya models by @xenova in #1529
Add support for AFMoE by @xenova in #1542
Add support for new Qwen VL models (Qwen2.5-VL, Qwen3-VL, Qwen3.5, and Qwen3.5 MoE) by @xenova in #1551
Add support for Qwen2 MoE, Qwen3 MoE, Qwen3 Next, Qwen3-VL MoE, and Olmo Hybrid by @xenova in #1562
Add support for EuroBERT by @xenova in #1583
Add support for LightOnOCR and GLM-OCR by @xenova in #1582
Add support for Nemotron-H by @xenova in #1585
Add support for DeepSeek-v3 by @xenova in #1586
Add support for mistral4 by @xenova in #1587
Add support for GLM-MoE-DSA by...