Training-Free Single-Image Diffusion Models

Efficient and Training-Free Single-Image Diffusion Models | CVPR 2026 Highlight

CVPR 2026 Highlight

Efficient and Training-Free Single-Image Diffusion Models

efficient-SID

Haojun Qiu, Kiriakos N. Kutulakos, David B. Lindell

University of Toronto · Vector Institute

Paper

Code (coming soon)

Demo (coming soon)

SinDDM 258×193 px

Input (select one):

0.0000%

Training... 0:00:00 ~8h 0m remaining

Waiting for training to finish...

⏩ Skip training

Done! 3 samples at 258×193 px

~8h training 1.6s inference

Ours (Training-Free) 1600×608 px ✓ Ready

Input (select one):

Generated (1/20)

0s training

Megapixel+

Zero training. Less than 1s per sample. Diverse generation.

A closed-form denoiser running in patch space eliminates neural network training from prior works — generating megapixel images in under a second, gigapixel in minutes.

Abstract

We consider the problem of generating images whose internal structure—defined by the distribution of patches across multiple scales—matches that of a single reference image. Recent approaches address this problem by training a diffusion model on a single image. But even in this setting, training is computationally expensive and requires hours of optimization. Instead, we model the image using a dataset of its patches at different scales. As this dataset is finite and the dimensionality of its patches is small, the score function for a noisy patch can be computed tractably using an optimal, closed-form denoiser, eliminating the need for neural network training. We integrate this patch-based denoiser into an efficient, training-free image diffusion model, and we describe how our method connects to classical patch-based image restoration techniques. Our approach achieves state-of-the-art generation quality and diversity compared to trained single-image diffusion models, and we demonstrate applications, including unconditional image generation, text-guided stylization, image symmetrization, and retargeting. Further, we show that our approach is compatible with latent space diffusion, and we show multiple additional acceleration techniques to achieve megapixel single-image generation in one second, and gigapixel generation in minutes.

Why Single-Image Generative Models (Still) Matter?

There are multiple reasons why single-image generative modeling is useful. In particular, learning a prior over image patches imposes strong constraints on the content of a generated image . Consider combining the style of one image with the structure of another: large models like GPT Image 2 or Nano Banana fill in content drawn from internet-scale training data, so the resulting patch statistics deviate from those of the input images. In contrast, single-image models constrain the output to contain exactly the patches present in the inputs, keeping style and structure closely aligned.

This kind of constraint is valuable in applications where data is scarce. Moreover, single-image modeling provides clear provenance over the content of the generated image .

Watch explanation (53s)

Show static figure

Method: Closed-Form Denoiser

The optimal denoiser in diffusion models has a known closed-form solution: a weighted average over all clean signals, where weights reflect how likely each produced the noisy input. For large datasets like ImageNet (millions of high-dimensional images), this sum is completely intractable — and as noise decreases, it collapses to memorization.

But for patches of a single image, both problems disappear. The dataset is finite and low-dimensional — the sum is tractable. And since we denoise patches, not whole images, stitching overlapping patch predictions together produces a rich distribution over global images, not a copy of the input.

No neural network. No training. No optimization.

Check out a related tutorial at CVPR 2026: Analytic Understanding of Diffusion Models.

Generation Pipeline

We extract overlapping patches from the reference image as our dataset. Starting from noise, at each timestep we extract noisy patches, apply the closed-form denoiser, and stitch predictions back into a full image. We generate at a coarse scale first to establish global structure, then progressively add finer scales — preserving layout while adding detail.

Watch animation (1m21s)

Show static figure

Acceleration: Making It Scale

The closed-form denoiser gives us the training-free part. But for the efficient part — scaling to megapixel and beyond — the naïve O(N²) cost per denoising step is a bottleneck. Interestingly, all three of our acceleration techniques are inspired by ideas that drove the scaling of large generative models:

FlashAttention

We show (Supp. Sec. 3.2) that our closed-form denoiser can be cast exactly as scaled dot-product attention — no specialized kernel needed. We simply call FlashAttention off-the-shelf for a ~2× speedup, fully exact.

↳ same kernel that powers large transformers

Latent Space (VAE)

A pretrained VAE acts as a pure...

Training-Free Single-Image Diffusion Models

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy