WebGPU feature detection was not enough to run small LLMs on phones — Ludion
WebGPU feature detection was not enough to run small LLMs on phones
Four test environments where the browser exposed WebGPU, and what the measurements say.
2026-06-18
I wanted to run a small language model in the browser, on the phone, without<br>sending inference to a server. The feature detection is easy. You ask for a<br>WebGPU adapter, you read its limits, and if the buffer sizes are large enough<br>you assume it will run. Every browser environment I tested exposed WebGPU. As a<br>first-pass check, the reported limits looked large enough for the model weights.
Then I ran them. What a device reports about its GPU and what an inference run<br>completes are two different things. Four cases from my own measurements.
All numbers below come from the raw measurement files in the repository. The<br>models are Llama-3.2-1B-Instruct, Qwen2.5-1.5B-Instruct, and Qwen2.5-0.5B-Instruct,<br>quantized to roughly 4-bit. The engines are WebLLM 0.2.84, transformers.js 4.2.0,<br>and wllama 3.4.1. Each run was cold cache, with a short prompt near 50 tokens and<br>a long prompt near 1200 tokens.
1. Safari on iPhone reloads the page during generation
The device is an iPhone 11 Pro Max on iOS 18.7, Safari 26.5. It reports<br>webgpu: true, an Apple adapter with f16 support, and a<br>maxBufferSize of 715827880 bytes. The reported maxBufferSize was<br>large enough for the model weights, at least as a first-pass check.
None of them completed. Qwen2.5-1.5B through WebLLM downloaded all 728 MB and<br>then failed at init with TypeError: Load failed. Llama-3.2-1B<br>through WebLLM got further, reached generation on the WebGPU backend, and then<br>the page reloaded mid-generation with no JavaScript-visible exception and no<br>out-of-memory error I could catch. The smaller Qwen2.5-0.5B through wllama did<br>the same thing at init: the<br>tab reloaded before it ever became ready. Across every engine and model on this<br>device, zero runs completed. The failure mode is not an error you handle. It is<br>the tab restarting under you.
2. LINE's in-app browser exposes WebGPU but the run never completes
The device is a Pixel 8a, 8 GB of memory, opened inside the LINE in-app browser<br>on Android 16. It reports webgpu: true, an Arm Valhall adapter with<br>f16, and a maxBufferSize of 4294967292 bytes, which is the full<br>4 GB ceiling. Nothing in the adapter limits distinguished it from the Chrome run<br>that completed.
The Llama-3.2-1B session started, stalled mid-download, and never reached a<br>single completed run. The results file for that session has an empty runs list.<br>The adapter report told me nothing about whether the in-app browser would carry<br>a download and an init to the end. It did not.
3. Same hardware and model, about two times the throughput by engine alone
On a Windows desktop with an AMD RDNA 4 GPU, Chrome 148, I ran the same<br>Llama-3.2-1B with the short prompt through all three engines. WebGPU is present<br>and used in every case. The decode rate is the median of three runs.
Llama-3.2-1B, short prompt, decode tokens per second (median of 3), AMD RDNA 4
enginedecode tok/s
WebLLM 0.2.84196.17<br>transformers.js 4.2.0125.41<br>wllama 3.4.197.61
The fastest engine decodes about twice as fast as the slowest on identical<br>hardware running the identical model. The WebGPU support flag reads the same for<br>all three. The measured throughput does not.
4. Pixel 8a completes, but a long prefill takes 76 seconds
The device is a Pixel 8a again, this time in plain Chrome 149, not an in-app<br>browser. The Arm Valhall adapter reports the same 4 GB buffer ceiling. Here the<br>model loads and runs to completion, so I have full timings.
With the short prompt of 52 input tokens, time to first token is about 3.8<br>seconds across three runs (3782, 3954, 3752 ms). With the long prompt of 1213<br>input tokens, time to first token is 77153, 76996, and 76449 ms. That is 76 to<br>77 seconds before the first token of the answer appears. Decode after that holds<br>near 9 tokens per second. The same device that handles a one-line prompt in a<br>few seconds takes well over a minute to read a page of context.
Across these four test environments, WebGPU exposure and large adapter limits<br>were not enough to predict whether a small LLM run would complete. Feature<br>detection answered whether WebGPU could be requested, not whether inference<br>would finish.