Google's DiffusionGemma uses diffusion tech to speed text generation

beardyw1 pts0 comments

Google's DiffusionGemma uses diffusion tech to speed text generation

Jump to main content

Search

REG AD

ai and ml

Google's new open-weights model brings image-generation tricks to AI text generation

Language model builds on diffusion tech to boost output performance by up to 4x, claims Chocolate Factory

Tobias Mann

Tobias<br>Mann

Systems editor

Published<br>thu 11 Jun 2026 // 19:31 UTC

The boffins on Google’s DeepMind team unveiled an experimental new language model this week that uses techniques originally developed for AI image generators to boost text output performance by as much as 4x when running on resource-constrained consumer hardware. It's free to download and you can run it with just 18 GB of DRAM or VRAM.<br>The model, codenamed DiffusionGemma, is the latest addition to Google’s open weights model family. But unlike Gemma 4, which launched this spring, the 26 billion-parameter mixture of experts (MoE) model isn’t a large language model in a conventional sense.<br>Instead, it’s actually closer to image models like Stable Diffusion or Flux. Rather than generating tokens one after another in an autoregressive fashion, DiffusionGemma generates entire paragraphs' worth of tokens at the same time.

REG AD

The process looks a lot like how a diffusion model turns what’s essentially static into an image through a series of denoising steps.

REG AD

As Google explains it, DiffusionGemma works by laying out a canvas of random tokens, and then refining them until the final output is reached.<br>Compared to conventional LLMs, which are memory-bandwidth bound and require a lot of VRAM, diffusion models are a predominantly compute-bound workload, which is why the Chocolate Factory is positioning these models for local deployment.<br>LLMs are autoregressive. During token generation, the model’s active parameters need to be streamed from memory for every token generated, making memory bandwidth a major bottleneck.<br>In the cloud, inference providers balance compute and memory bandwidth by processing hundreds or thousands of requests in parallel. As you might have guessed, this isn’t something the average user running a local model on their notebook can do.<br>However, many consumer products, like high-end graphics cards, have plenty of excess horsepower, which DiffusionGemma can take advantage of to boost output performance.<br>Diffusion language models aren’t perfect. Google isn’t the first to explore this tech. Previous models, like DREAM or Mercury 2, demonstrated major speedups over conventional LLMs, but generally underperformed them in benchmarks for their size.<br>DiffusionGemma doesn’t appear to be any different. According to Google, the 26 billion-parameter model falls just behind Gemma 4 12B in the GPQA-Diamond benchmark, with its main advantage being output speed, and even then it’s not as impressive as Google has made it out to be.

Here's how DiffusionGemma compares to the rest of the Gemma 4 lineup in terms of output quality versus output speed.<br>Google

REG AD

The chart shows a roughly 2.25x speedup for DiffusionGemma over the 12B parameter LLM with speculative decode enabled. Compared to Gemma 4 26B-A4B, the speedup is nearly 4x when running a single Nvidia H100.

MORE CONTEXT

60 years since humanity touched the surface of another planet

Oracle and OpenAI's Texas Stargate datacenter expansion reportedly on the skids

Don’t blame AI yet for poor jobs numbers, analysts say

US state laws push age checks into the operating system

DiffusionGemma is being released as an experimental model rather than an enterprise focused one, like we saw with Gemma 4.<br>The model is available for download on popular model repos like Hugging Face under a highly permissive Apache 2.0 license with support already merged into popular inference engines like vLLM, MLX, and HF Transformers, with support for Llama.cpp coming soon.<br>While local inference has largely been the domain of AI enthusiasts, companies like Google are increasingly leaning on the tech to cut cloud costs associated with their AI services. As you may recall, back in May, Google quietly began shipping a small LLM with its Chrome web browser. ®

diffusion models<br>gemma<br>google<br>ai and ml<br>deepmind<br>ai

REG AD

BOFH

BOFH: For one ambitious security type, chaos is a ladder

Mission Control sends its regards

offbeat

Windows bowls a BSOD at sports fans

It's just not cricket

ZTE wins three Selular Award 2026 honors for AI-powered network innovation

PARTNER CONTENT: Recognized for breakthrough achievements in FWA, Network Ecosystem, and Native AI Baseband, ZTE solidifies its role as a key driver of Indonesia’s 5G-Advanced and AI economic growth

SYSTEMS

Delos Data offers AI chip startups a fast track to rack scale

Half the trouble of building an Nvidia NVL or AMD Helios competitor is just getting the networking out of the box

PAAS AND IAAS

Graviton 5 impresses, but please, for the love of all that's holy, stop calling them 'AI chips'

AWS better at running chip fabs than...

model google diffusiongemma diffusion like output

Related Articles