Multi-Turn Reflective Masking Elicits Reasoning in Mask Diffusion Models
TL;DR — Reasoning by editing, not regenerating. Reflective Masking<br>turns a Mask Diffusion Model into a multi-turn reviser: it erases uncertain tokens,<br>regenerates only what is needed, and remembers previous attempts.
Abstract
Recent diffusion language models — such as Google's<br>DiffusionGemma — show<br>that text generation need not be left-to-right: a model can refine a whole canvas using<br>bidirectional context. We ask a complementary question: can existing<br>Mask Diffusion Models (MDMs) be taught to reason by revising their own previous<br>outputs? We propose Reflective Masking (RM) , a lightweight post-training method that<br>turns masking into a model-driven decision — keep reliable tokens, re-mask uncertain<br>ones, and reveal better replacements — making an MDM a multi-turn reviser rather than a<br>one-shot decoder. To support multi-turn correction we add History Reference , a<br>parameter-free memory that exposes the denoising trajectory to the model. Unlike a large<br>pretrained diffusion LM, RM needs no architectural changes and no online rollouts, and drops<br>into existing MDMs across Sudoku, text reasoning, and image editing — enabling<br>sparse, iterative self-revision.
1Re-masking is the self-correction MDMs were missing. MDMs<br>can edit in place but never choose to — so they lock in early mistakes. RM<br>makes masking a model-driven decision (keep reliable tokens, re-mask uncertain ones, reveal<br>better replacements), so the model fixes its own errors instead of carrying them forward.
2A lightweight post-training recipe — no new architecture.<br>RM is activated by a scalable offline data pipeline (no online rollouts) and drops into<br>existing MDMs unchanged — validated across text, Sudoku, and image editing.
3History Reference — a memory of past attempts, for free.<br>A parameter-free mechanism that carries the denoising trajectory forward, so the model<br>remembers what it already tried and stops repeating the same error.
CoT thinks by continuing. RM thinks by revising.
A diffusion-native analogue of chain-of-thought reflection.
Side-by-side: AR Reasoning vs. Reflective Masking Reasoning
AR reasoning / reflectionReflective Masking in MDMs
Generates thoughts left-to-rightRevises a full canvas bidirectionally<br>Corrects mistakes by appending more text or regeneratingCorrects mistakes by re-masking only unreliable tokens<br>Past mistakes remain in contextWrong tokens can be erased from the current state<br>Test-time scaling = longer traces / more samplesTest-time scaling = more rounds of selective revision<br>Memory is textual contextMemory is History Reference over denoising states
Results
Reasoning through explicit revision
Sudoku<br>Image editing<br>Text reasoning
Three task families, from instruction-rich image editing to open-ended text reasoning.<br>Reflective Masking consistently beats masking-based baselines, and History Reference<br>helps most where the model must explore on its own — all trained in about<br>5 hours on 2×H100 .
Sudoku — structured error correction
A tiny from-scratch MDM (0.81M params) recovers 9×9 boards with 4–20 corrupted cells<br>by iterative re-masking. History Reference (HR) sharply cuts repeated mistakes and rule<br>conflicts; adding History Embedding Rotation (HER) tops every metric.
Example 1<br>Example 2
Step 0 / 8<br>initial · corrupted
Errors remaining: 19<br>Re-masked: 0
↻ Restart<br>‹ Prev<br>❚❚ Pause<br>Next ›
wrong digit<br>re-masked<br>just corrected
Reflective masking on Sudoku. Two real revision trajectories: the model re-masks cells it<br>is unsure about (amber) and re-predicts them, turning wrong digits (red) into the correct<br>solution until the board is valid — driving errors down to 0. Switch examples, press play,<br>or step through manually.
Variant<br>Exact Accuracy<br>% ↑<br>Valid Rate<br>% ↑<br>Replay Mistake<br>% ↓<br>Conflict Cells<br>/board ↓
RM (no History Reference)<br>82.486.60.570.578
RM + HR<br>91.4↑9.0<br>91.8↑5.2<br>0.07↓0.50<br>0.300↓0.278
RM + HR + decay<br>89.4↑7.0<br>89.6↑3.0<br>0.07↓0.50<br>0.362↓0.216
Ours — RM + HR + decay + HER<br>93.4↑11.0<br>93.6↑7.0<br>0.03↓0.54<br>0.236↓0.342
Quantitative results on Sudoku revision. Δ is the change versus the<br>RM (no History Reference) baseline; bold marks the best value per column.
Relation to DiffusionGemma (Google). DiffusionGemma independently validates<br>reasoning-by-revision on Sudoku: per its model card, exact-solve rises from<br>18% one-shot → 89.5% purely by revising over steps, and from<br>1.5% → 89.5% after fine-tuning a large pretrained model for<br>4,000 steps. Reflective Masking reaches an even higher 93.4% exact<br>accuracy with a 0.81M-parameter MDM trained from scratch — orders of magnitude<br>smaller than DiffusionGemma's fine-tuned backbone — and extends the same revision<br>mechanism beyond text to image editing, a modality DiffusionGemma<br>does not support.
DiffusionGemma: Google, “DiffusionGemma: 4× faster text...