Interfaze
Beta
pricing<br>help<br>docs<br>blog<br>sign in
The First Open Source Diffusion Audio ASR Model<br>copy markdown
We taught a diffusion model to listen in six languages.
We trained diffusion-gemma-asr-small, an audio-native ASR model that transcribes through DiffusionGemma's own diffusion decoder. It runs at 3-5× (shorter-longer audio clips) real-time, trains ~42M parameters on top of a frozen 26B backbone, and the hard part had nothing to do with speed.
DiffusionGemma for ASR is the first multilingual audio diffusion ASR model with six languages from one adapter. It's the first built on diffusion whose decoder denoises by uniform, random-token diffusion which initializes the canvas with random vocabulary tokens and anneals them into text, rather than the absorbing scheme the modern diffusion-LLM crowd uses. And it's the first to do speech recognition by training nothing but a ~42M parameter adapter on a frozen, off-the-shelf diffusion LLM , no decoder trained from scratch, 0.16% of the weights touched. Where it overlaps the closest prior system it already wins: 6.6% vs Whisfusion's 8.3% on LibriSpeech, with a smaller encoder.
What DiffusionGemma actually does
DiffusionGemma is Google's 26B mixture-of-experts model (4B active, 128 experts, top-8) that generates text by discrete diffusion instead of autoregression. The detail that matters: it's uniform diffusion, not the absorbing kind most people picture.
It starts with a fixed-length canvas 256 token slots filled with random tokens from the vocabulary . Each denoising step, the model looks at the whole canvas at once, keeps the predictions it's confident about, and re-randomizes the rest. After a few steps the noise anneals into text. Training mirrors this: corrupt a fraction γ of a clean sequence to random vocab IDs, ask the model to recover the originals.
Architecturally it's an encoder-decoder with tied weights. The encoder reads the prompt causally into a KV cache; the decoder denoises the canvas bidirectionally, cross-attending to that cache. Out of the box it takes text, images, and video. No audio. That is what we address with our custom architecture.
How do you make a text model hear?
Our initial attempt involved skipping the audio encoder entirely. Gemma's own unified models project raw waveforms straight into the embedding space, so why not feed DiffusionGemma 40ms audio frames and let 26B parameters figure out the acoustics?
It failed completely. A frozen LLM has never seen a spectrogram, the embedding space has no notion of formants or phonemes, and gradient signal through a frozen backbone isn't enough to build an acoustic frontend from scratch. The model learned to ignore the audio and hallucinate fluent, confident nonsense.
So we introduced a frozen whisper-small encoder, strictly as a feature extractor, not a decoder. Whisper turns 30 seconds of audio into 1500 frames of 768-dim acoustic features. A small trainable projector (a couple of conv layers that subsample 8× plus a linear map to 2816 dims) compresses those into 188 "audio tokens," which we scatter into placeholder slots in the prompt right where the tokenizer already reserves IDs. Add LoRA adapters on the encoder/decoder attention so the backbone can learn to attend to this new modality, and you have a model that, on paper, should transcribe.
It didn't. And the way it didn't is the interesting part.
The chicken-and-egg wall
Training loss flatlined around 8. The auxiliary autoregressive loss sat at 4.00. The model wasn't learning.
We noticed that the failure was circular, the projector starts random, so its output is noise, leading to the attention layers learning to ignore it, therefore almost no gradient reaches the projector, meaning the model never gets to learn anything, causing the results to be output noise. Both training tasks only use the audio when the model's attention actually looks at it, and while attention is ignoring the projector's random output, nothing in training ever tells the projector how to improve.
The unlock: supervise the projector directly
The fix was to give the projector a learning signal that skips attention altogether. We take the projector's 188 audio tokens, run them straight through DiffusionGemma's frozen lm_head, and apply a CTC loss against the transcript.
This sidesteps the whole standoff. CTC forces the audio embeddings to be linearly predictive of the right words in the model's own token space, no attention required. The projector now has a gradient that doesn't depend on anyone trusting it yet. Once its outputs actually mean something, attention has a reason to look, and the diffusion and AR objectives finally catch.
The plateau broke on the first try. CTC loss dropped 24 → 8.6 in 300 steps. Held-out token accuracy climbed off the floor.
The metric that lied
Token accuracy hit 0.50 and CTC loss kept falling and everything looked like grounding. Then upon running a small manual eval sample we noticed repetition.
the the the...