Fable 5 pushed Gemma 4 to 255 tok/s on WebGPU

kirubakaran2 pts0 comments

Xenova (@xenovacom): "Before Fable 5 was shut down, it pushed Gemma 4 to 255 tok/s on WebGPU. Some didn't believe it was real.

Today we're releasing the demo and kernels it wrote for you to see yourself. Run it locally in your browser.

Agentic kernel optimization is the future of on-device inference" | XCancel

Xenova

@xenovacom

21h

Before Fable 5 was shut down, it pushed Gemma 4 to 255 tok/s on WebGPU. Some didn't believe it was real.

Today we're releasing the demo and kernels it wrote for you to see yourself. Run it locally in your browser.

Agentic kernel optimization is the future of on-device inference

Xenova

@xenovacom

Jun 13

I gave Fable 5 one job: write custom WebGPU kernels for Gemma 4 inference.

It climbed to 84 tok/s, then hit a wall, insisting further optimization was impossible.

Hours later, Anthropic rolled back invisible LLM development safeguards, and it hit 255 tok/s.

The next day, access to Fable 5 was suspended globally.

Jun 17, 2026 · 4:54 PM UTC

69

160

1,733

264,359

Xenova

@xenovacom

21h

In case you hadn't noticed, we're working on something big. Stay tuned.

🔗 Link to the demo: huggingface.co/spaces/webml-…

Gemma 4 WebGPU Kernels - a Hugging Face Space by webml-community

Discover amazing ML apps made by the community

huggingface.co

120

7,879

Loïck Chambon (PhD in CV) - 🇺🇦🇮🇷🇪🇺@LoickCh

5h

Replying to @xenovacom

Do we know what they optimized?

232

Unni@karmakomik

17h

Replying to @xenovacom

Will try but I hope it did not just optimise for your GPU 😅

1,504

The Singularity Project

@01Singularity01

20h

Replying to @xenovacom

Failed to load: No supported WebGPU variant for com.xenova.gemma4.DecodeOprojNorm; rejected fused_rows: when guard resolved to false; fused: when guard resolved to false

18

3,100

xcaliburr@xscorpiox101

19h

Replying to @xenovacom

I'm much more interested if the output is still correct, quality normally deteriorates when speeds increases

1,732

Fab 🇧🇷🇨🇦

@FlockonUS

19h

Replying to @xenovacom

How many GB will my browser load if i access the page?

11

4,646

octalmage

@octalmage

17h

Replying to @xenovacom

it doesn't know...

20

2,888

Ian Danforth@iand_elicit

18h

Replying to @xenovacom

As far as I can tell it's fast and not very high quality. So interesting technical work, but I wouldn't use the model for anything.

1,830

The Singularity Project

@01Singularity01

18h

Replying to @xenovacom

WebGPU: Hardware accelerated<br>Adapter selected with `powerPreference: "high-performance"`:<br>```js<br>vendor: "nvidia",<br>architecture: "ampere",<br>subgroupMinSize: 32,<br>subgroupMaxSize: 128,<br>features: [<br>"shader-f16",<br>"subgroups",<br>"timestamp-query",<br>...<br>```<br>GPU: NVIDIA GeForce RTX 2050<br>Likely cause<br>The embedded `DecodeOprojNorm` variant guard appears to require an exact fixed subgroup range:<br>```js<br>device.features.has("subgroups") &&<br>device.adapterInfo.subgroupMinSize == 32 &&<br>device.adapterInfo.subgroupMaxSize == 32<br>```<br>On this NVIDIA Ampere/D3D12 adapter, Chrome reports:<br>```js<br>subgroupMinSize: 32<br>subgroupMaxSize: 128

So both `fused_rows` and `fused` variants are rejected before compilation, even though the adapter supports `subgroups` and `shader-f16`.<br>Suggested fix<br>Please add a compatible fallback or relax/add a variant for adapters where subgroup size includes 32 but `subgroupMaxSize > 32`, e.g. NVIDIA/D3D12. If the WGSL is safe for 32-lane subgroup assumptions, the guard might be closer to:<br>```js<br>subgroupMinSize = 32

Otherwise, a separate NVIDIA/D3D12 variant or non-fixed-subgroup fallback would allow the demo to run on hardware-backed WebGPU adapters that expose a subgroup range rather than fixed 32.

1,617

ansuman

@ansuman_bin

17h

Replying to @xenovacom

bro retired too early!

1,087

WuBu ⪋ WaefreBeorn 🇺🇸 👑

@waefrebeorn

19h

Replying to @xenovacom @crosstensor

thank you for releasing the work for peer review

i respect your efforts now

1,115

ZenithAi

@ZenithAiLab

5h

Replying to @xenovacom

255 tok/s in your browser. Fable 5 proved it, now you can run it. Agentic kernels = local AI unchained

249

Adria B.A.@Adria_MBA

15h

Replying to @xenovacom

That one went to 500 tok/sec

Leandro von Werra

@lvwerra

Jun 16

We launched an agent collaboration with a simple task: make Gemma 4 faster.

Over 100 agents from all over the world joined, exchanged 1000+ messages and submitted 450 results.

A week of collaboration later the throughput went from 100 tok/s to over 500 tok/s.

13

1,508

NeoLabsFlow@Sika12225983

16h

Replying to @xenovacom @ClementDelangue

Hmm, curious if it really speeds things up!

2,024

usul365

@yusufgider

3h

Replying to @xenovacom

Fable 5 WebGPU'da Gemma 4'ü 255 token/s'ye taşıdı — tarayıcıda, yerel olarak. 🚀<br>İnanmayanlar için kod açık kaynak yapıldı. Kendiniz deneyin.<br>Cihaz üzerinde çıkarım artık teori değil, gerçek. Bulut bağımlılığı bitiyor mu? 👇

127

Vabbyshabby

@vabbyshabby

9h

Replying to @xenovacom

255 tok/s on webgpu with gemma 4 is the milestone that separates a...

xenovacom replying webgpu gemma fable xenova

Related Articles