Efficient and Training-Free Single-Image Diffusion Models | CVPR 2026 Highlight
CVPR 2026 Highlight
Efficient and Training-Free<br>Single-Image Diffusion Models
efficient-SID
Haojun Qiu,<br>Kiriakos N. Kutulakos,<br>David B. Lindell
University of Toronto · Vector Institute
Paper
Code (coming soon)
Demo (coming soon)
SinDDM 258×193 px
Input (select one):
0.0000%
Training... 0:00:00<br>~8h 0m remaining
Waiting for training to finish...
⏩ Skip training
Done! 3 samples at 258×193 px
~8h training<br>1.6s inference
vs
Ours (Training-Free) 1600×608 px<br>✓ Ready
Input (select one):
Generated (1/20)
0s training
Megapixel+
Zero training. Less than 1s per sample. Diverse generation.
A closed-form denoiser running in patch space eliminates neural network training from prior works — generating megapixel images in under a second, gigapixel in minutes.
Abstract
We consider the problem of generating images whose internal structure—defined by the distribution of patches across multiple scales—matches that of a single reference image. Recent approaches address this problem by training a diffusion model on a single image. But even in this setting, training is computationally expensive and requires hours of optimization. Instead, we model the image using a dataset of its patches at different scales. As this dataset is finite and the dimensionality of its patches is small, the score function for a noisy patch can be computed tractably using an optimal, closed-form denoiser, eliminating the need for neural network training. We integrate this patch-based denoiser into an efficient, training-free image diffusion model, and we describe how our method connects to classical patch-based image restoration techniques. Our approach achieves state-of-the-art generation quality and diversity compared to trained single-image diffusion models, and we demonstrate applications, including unconditional image generation, text-guided stylization, image symmetrization, and retargeting. Further, we show that our approach is compatible with latent space diffusion, and we show multiple additional acceleration techniques to achieve megapixel single-image generation in one second, and gigapixel generation in minutes.
Why Single-Image Generative Models (Still) Matter?
There are multiple reasons why single-image generative modeling is useful. In particular, learning a prior over image patches imposes strong constraints on the content of a generated image . Consider combining the style of one image with the structure of another: large models like GPT Image 2 or Nano Banana fill in content drawn from internet-scale training data, so the resulting patch statistics deviate from those of the input images. In contrast, single-image models constrain the output to contain exactly the patches present in the inputs, keeping style and structure closely aligned.
This kind of constraint is valuable in applications where data is scarce. Moreover, single-image modeling provides clear provenance over the content of the generated image .
Watch explanation (53s)
Show static figure
Method: Closed-Form Denoiser
The optimal denoiser in diffusion models has a known closed-form solution: a weighted average over all clean signals, where weights reflect how likely each produced the noisy input. For large datasets like ImageNet (millions of high-dimensional images), this sum is completely intractable — and as noise decreases, it collapses to memorization.
But for patches of a single image, both problems disappear. The dataset is finite and low-dimensional — the sum is tractable. And since we denoise patches, not whole images, stitching overlapping patch predictions together produces a rich distribution over global images, not a copy of the input.
No neural network. No training. No optimization.
Check out a related tutorial at CVPR 2026: Analytic Understanding of Diffusion Models.
Generation Pipeline
We extract overlapping patches from the reference image as our dataset. Starting from noise, at each timestep we extract noisy patches, apply the closed-form denoiser, and stitch predictions back into a full image. We generate at a coarse scale first to establish global structure, then progressively add finer scales — preserving layout while adding detail.
Watch animation (1m21s)
Show static figure
Acceleration: Making It Scale
The closed-form denoiser gives us the training-free part. But for the efficient part — scaling to megapixel and beyond — the naïve O(N²) cost per denoising step is a bottleneck. Interestingly, all three of our acceleration techniques are inspired by ideas that drove the scaling of large generative models:
FlashAttention
We show (Supp. Sec. 3.2) that our closed-form denoiser can be cast exactly as scaled dot-product attention — no specialized kernel needed. We simply call FlashAttention off-the-shelf for a ~2× speedup, fully exact.
↳ same kernel that powers large transformers
Latent Space (VAE)
A pretrained VAE acts as a pure...