Fast On-Device GenAI with LiteRT-LM

Blazing fast on-device GenAI with LiteRT-LM

- Google Developers Blog

Blazing fast on-device GenAI with LiteRT-LM

MAY 19, 2026

Tenghui Zhu

Staff Software Engineer

Yu-hui Chen

Staff Software Engineer

Ram Iyengar

Senior Staff Software Engineer

Facebook

Twitter

Mail

When it comes to bringing advanced AI to the edge, Google AI Edge’s LiteRT-LM delivers one of the most powerful and optimized experiences for deploying Gemma 4 across platforms. Leveraging LiteRT (formerly TensorFlow Lite) for inference, LiteRT-LM empowers local AI across a multitude of Google products—including Chrome, ChromeOS, the Pixel Watch, and the recent viral Google AI Edge Gallery app (Android / iOS). From unlocking state-of-the-art agentic capabilities with Gemma 4 to scaling our demanding production use cases, this proven engine is now ready to power your own applications. Read on for a deep dive into the underlying stack and how you can use LiteRT-LM for your own edge LLM deployments. State-of-the-art performance To fully unlock Gemma 4 on-device, we leverage the Google AI Edge stack, the most performant way to run Gemma 4 across platforms (for even greater performance, Gemma 4 can be run as system-service via Android AICore). To navigate the competing demands of restricted memory, limited compute, and fragmented hardware, this stack supports advanced quantization schemes alongside a foundation of accelerated XNNPACK and MLDrift kernels. By coupling this efficient footprint with the LiteRT runtime, the stack unlocks seamless model execution and broad portability across CPU, GPU, and NPU backends. Finally, at the orchestration layer, LiteRT-LM utilizes optimized pipelines to avoid costly CPU/GPU data transfers, alongside Multi-Token Prediction (MTP) and advanced session management. Together, this complete integration provides the highest-performing runtime environment for Gemma models.

LiteRT-LM prefill and decode performance running Gemma 4 E2B (Android: Samsung S26 Ultra, iOS: iPhone 17 Pro, Web: Chrome on a MacBook Pro 2024 with Apple M4 Max).

Built for speed across hardware backends and platforms LiteRT-LM is engineered to deliver exceptional performance across the entire edge ecosystem, ensuring low-latency inference on Android, iOS, and the open web. To achieve this, the runtime provides the most optimal hardware backend optimizations through LiteRT, seamlessly accelerating workloads via CPU, GPU, and NPU (currently on Android). This approach allows developers to build once and achieve peak performance everywhere: When running Gemma 4 E2B without MTP enabled, LiteRT-LM achieves an impressive 52 tokens/sec decode speed via the GPU backend on Android (OpenCL), and 56 tokens/sec on iOS (Metal). On the web, using WebGPU, developers can expect decode speeds of up to 76 tokens/sec decode on a Macbook Pro , proving that state-of-the-art on-device AI is now a reality regardless of the user's platform or hardware. Multi-Token Prediction (MTP) for peak throughput One of the most significant performance milestones in the LiteRT-LM pipeline is our native support for the Multi-Token Prediction (MTP) drafters recently launched with the Gemma 4 model family. By integrating this specialized speculative decoding architecture, LiteRT-LM bypasses traditional latency bottlenecks to deliver up to a 2.2x speedup. Standard LLM inference is fundamentally memory-bandwidth bound; processors spend the majority of their time moving billions of parameters from VRAM to compute units just to generate a single token. While speculative decoding mitigates this, naive implementations can introduce new bottlenecks. LiteRT-LM prevents this by optimizing the data interplay between the primary Gemma 4 model and the MTP drafter. To achieve this, LiteRT-LM enforces memory locality by executing both the lightweight MTP drafter and the primary model on the same hardware IP (e.g., the GPU). Managing the shared KV cache and activations within local memory entirely eliminates the latency penalties of cross-IP synchronization and data transfers. Once the drafter predicts future tokens, the primary model evaluates them using optimized kernels that maximize parallelization during verification. This streamlined architecture accelerates multi-token throughput without losing reasoning quality.

Sorry, your browser doesn't support playback for this video

Enabling MTP in the LiteRT-LM pipeline requires only two lines of configuration, instantly unlocking up to 2.2x decoding speedup for low-latency applications. Numbers reported are collected on Samsung S26 Ultra using the GPU backend.

Session management for speed and continuity Advanced session management in LiteRT-LM fundamentally transforms how mobile applications handle long-context interactions. By supporting native session save and restore capabilities, the engine allows large KV cache states—representing longer context histories—to be serialized and safely preserved across sessions....

Fast On-Device GenAI with LiteRT-LM

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast