Google's DiffusionGemma uses diffusion tech to speed text generation
Jump to main content
Search
REG AD
ai and ml
Google's new open-weights model brings image-generation tricks to AI text generation
Language model builds on diffusion tech to boost output performance by up to 4x, claims Chocolate Factory
Tobias Mann
Tobias<br>Mann
Systems editor
Published<br>thu 11 Jun 2026 // 19:31 UTC
The boffins on Google’s DeepMind team unveiled an experimental new language model this week that uses techniques originally developed for AI image generators to boost text output performance by as much as 4x when running on resource-constrained consumer hardware. It's free to download and you can run it with just 18 GB of DRAM or VRAM.<br>The model, codenamed DiffusionGemma, is the latest addition to Google’s open weights model family. But unlike Gemma 4, which launched this spring, the 26 billion-parameter mixture of experts (MoE) model isn’t a large language model in a conventional sense.<br>Instead, it’s actually closer to image models like Stable Diffusion or Flux. Rather than generating tokens one after another in an autoregressive fashion, DiffusionGemma generates entire paragraphs' worth of tokens at the same time.
REG AD
The process looks a lot like how a diffusion model turns what’s essentially static into an image through a series of denoising steps.
REG AD
As Google explains it, DiffusionGemma works by laying out a canvas of random tokens, and then refining them until the final output is reached.<br>Compared to conventional LLMs, which are memory-bandwidth bound and require a lot of VRAM, diffusion models are a predominantly compute-bound workload, which is why the Chocolate Factory is positioning these models for local deployment.<br>LLMs are autoregressive. During token generation, the model’s active parameters need to be streamed from memory for every token generated, making memory bandwidth a major bottleneck.<br>In the cloud, inference providers balance compute and memory bandwidth by processing hundreds or thousands of requests in parallel. As you might have guessed, this isn’t something the average user running a local model on their notebook can do.<br>However, many consumer products, like high-end graphics cards, have plenty of excess horsepower, which DiffusionGemma can take advantage of to boost output performance.<br>Diffusion language models aren’t perfect. Google isn’t the first to explore this tech. Previous models, like DREAM or Mercury 2, demonstrated major speedups over conventional LLMs, but generally underperformed them in benchmarks for their size.<br>DiffusionGemma doesn’t appear to be any different. According to Google, the 26 billion-parameter model falls just behind Gemma 4 12B in the GPQA-Diamond benchmark, with its main advantage being output speed, and even then it’s not as impressive as Google has made it out to be.
Here's how DiffusionGemma compares to the rest of the Gemma 4 lineup in terms of output quality versus output speed.<br>Google
REG AD
The chart shows a roughly 2.25x speedup for DiffusionGemma over the 12B parameter LLM with speculative decode enabled. Compared to Gemma 4 26B-A4B, the speedup is nearly 4x when running a single Nvidia H100.
MORE CONTEXT
60 years since humanity touched the surface of another planet
Oracle and OpenAI's Texas Stargate datacenter expansion reportedly on the skids
Don’t blame AI yet for poor jobs numbers, analysts say
US state laws push age checks into the operating system
DiffusionGemma is being released as an experimental model rather than an enterprise focused one, like we saw with Gemma 4.<br>The model is available for download on popular model repos like Hugging Face under a highly permissive Apache 2.0 license with support already merged into popular inference engines like vLLM, MLX, and HF Transformers, with support for Llama.cpp coming soon.<br>While local inference has largely been the domain of AI enthusiasts, companies like Google are increasingly leaning on the tech to cut cloud costs associated with their AI services. As you may recall, back in May, Google quietly began shipping a small LLM with its Chrome web browser. ®
diffusion models<br>gemma<br>google<br>ai and ml<br>deepmind<br>ai
REG AD
BOFH
BOFH: For one ambitious security type, chaos is a ladder
Mission Control sends its regards
offbeat
Windows bowls a BSOD at sports fans
It's just not cricket
ZTE wins three Selular Award 2026 honors for AI-powered network innovation
PARTNER CONTENT: Recognized for breakthrough achievements in FWA, Network Ecosystem, and Native AI Baseband, ZTE solidifies its role as a key driver of Indonesia’s 5G-Advanced and AI economic growth
SYSTEMS
Delos Data offers AI chip startups a fast track to rack scale
Half the trouble of building an Nvidia NVL or AMD Helios competitor is just getting the networking out of the box
PAAS AND IAAS
Graviton 5 impresses, but please, for the love of all that's holy, stop calling them 'AI chips'
AWS better at running chip fabs than...