Accelerating Gemini Nano Models on Pixel with Frozen Multi-Token Prediction

Accelerating Gemini Nano models on Pixel with frozen Multi-Token Prediction

Google

Research

Accelerating Gemini Nano models on Pixel with frozen Multi-Token Prediction

June 26, 2026 Eden Cohen, Research Product Manager, and Michelle Ramanovich, Research Manager, Google Platforms and Devices

We introduce a method to retrofit Multi-Token Prediction onto frozen production models, accelerating on-device inference without the inefficiencies of separate drafters.

Quick links

Keyword blog

Copy link

Having powerful Large Language Models (LLMs) right in your pocket is now a reality with on-device models like Gemini Nano and Gemma. This technology enables everyday features on your phone — such as instantly summarizing a flurry of notifications or proofreading an important text message — all without sending your private data off device. But to make these features useful for everyday users, they need to happen very efficiently. Delivering this kind of speed on a mobile device is a significant challenge. Unlike vast server environments, mobile phones operate under a strict energy budget and hard memory (RAM) limits. Furthermore, standard language models generate text "autoregressively" — meaning they process and output just one word (or token) at a time. This step-by-step process creates a bottleneck, underutilizing the phone's processing power while straining its memory bandwidth, which can ultimately slow down the user experience and drain the battery. To overcome this bottleneck, we are announcing a new architecture that retrofits Multi-Token Prediction (MTP) onto existing, "frozen" Gemini Nano v3 models. Building on prior approaches like the EAGLE framework and Confident Adaptive Language Modeling (CALM), we designed new architectural components to maximize these efficiency gains specifically for mobile environments. Our recent announcements highlighted accelerating Gemma 4 with MTP and making it available to developers. Today's article tackles the unique, extreme constraints of edge computing. Recently rolled out to the Pixel 9 and 10 series, this approach acts as an out-of-the-box speedup. For users, this means that features like AI Notification Summaries and Proofread generate text significantly faster and with less energy consumption. For developers, it eliminates a major friction point: delivering high-speed on-device AI without the need to fine-tune separate, memory-heavy drafting models for every new task.

A "late exit" strategy MTP builds upon the evolution of speculative decoding. In a traditional setup, generating N tokens requires N forward passes of the large model. Speculative decoding decouples this process into two parts: Draft: a smaller, faster approximation model (the "drafter") generates a short sequence of candidate tokens (e.g., 3 tokens). Verify: a large model (the "verifier") processes these candidates in parallel. If the candidates match what the large model would have predicted, they are accepted. If not, the system rolls back to the first divergence. However, this results in some inefficiencies. Running a separate "standalone" drafter model (e.g., 128M parameters) competes for limited RAM. Furthermore, a standalone drafter is "blind" to the main model's rich internal state, predicting next tokens based solely on text history without the semantic context the main model has already computed. MTP addresses these inefficiencies by moving from a standalone architecture to an integrated one. Instead of training a separate small language model to draft tokens, we append a lightweight Transformer head, the MTP head, to the final layers of the main model. This architecture, which uses a deep exit layer for drafting, leverages the work already performed by the main model’s backbone. The MTP head takes the final high-dimensional activations (hidden states) of the main model and uses them to autoregressively predict a sequence of future tokens.

The frozen backbone advantage While MTP heads are commonly pre-trained in tandem with the backbone — such as in our recent releases of Gemma 4 models — this is prohibitive when leveraging already-deployed on-device foundation models. Instead, our work focuses on retrofitting the drafter head to operate independently of the pre-training pipeline. We take a fully trained Gemini Nano v3 model, freeze its weights, and attach a dense transformer stack — the MTP head — to the final layers. We train only these parameters to minimize the prediction error on future tokens. With a frozen backbone, MTP becomes strictly an efficiency optimization, ensuring no degradation in the base model's capabilities or safety alignment. Because incorrect drafts are discarded during verification, the final output remains bit-for-bit identical to the main model, allowing us to roll out efficiency updates with full backward compatibility.

Zero-copy architecture While standard MTP implementations optimize for training efficiency by sharing...

Accelerating Gemini Nano Models on Pixel with Frozen Multi-Token Prediction

Related Articles

(no title)

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

ZCode – Harness for GLM-5.2

Apertus – Open Foundation Model for Sovereign AI