PID: Fast and High-Resolution Latent Decoding with Pixel Diffusion

nnx1 pts0 comments

PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion

Spatial Intelligence Lab

PiD:<br>Fast and High-Resolution Latent Decoding<br>with Pixel Diffusion

Yifan Lu<br>Qi Wu<br>Jay Zhangjie Wu<br>Zian Wang<br>Huan Ling<br>Sanja Fidler<br>Xuanchi Ren

NVIDIA

arXiv 2026<br>-->

Read Paper (arXiv)

Model

Code

TL;DR:<br>PiD directly decodes latent representations into high-resolution images, replacing the decode–then–super-resolve cascade while achieving lower latency and higher visual quality.

Real Image Latent

Generated Image Latent

SD3 VAE

VAE Decoder

PiD

DINOv2

RAE Decoder

PiD

Z-Image

VAE Decoder

PiD

Flux.2 [dev]

VAE Decoder

PiD

Abstract

Most practical high-resolution text-to-image systems rely on latent diffusion models, where generation is performed in a compact latent space and a decoder maps latents back to pixels. Yet the latent-to-pixel decoder is reconstruction-oriented, optimized to invert the encoder rather than synthesize more details, and becomes increasingly costly at megapixel scale. This drawback calls for a more expressive and efficient decoding paradigm. Motivated by recent progress in scalable pixel-space diffusion, we introduce PiD , a Pi xel diffusion D ecoder that reformulates latent decoding as conditional pixel diffusion, unifying decoding and upsampling into one generative module. By denoising directly in high-resolution pixel space, PiD synthesizes 4&times; and even 8&times; upscaled images with low latency. For latent conditioning, a lightweight sigma-aware adapter injects noise-corrupted latents into the pixel diffusion backbone, enabling PiD to decode partially denoised latents and terminate the latent diffusion process early. To further improve efficiency, we distill the model using DMD2, reducing inference to just 4 steps. PiD applies to both conventional VAE latents and semantic latents (e.g., SigLIP, DINOv2) used in recent RAE-based models. PiD decodes latents of 512&times;512 images into 2048&times;2048 pixels in under 1 second with 13 GB peak memory on a consumer RTX 5090, and as fast as 210 ms on a GB200 GPU, about 6&times; faster than cascaded diffusion-based super-resolution pipelines with better visual fidelity.

Results

From Latent to Pixels

Select a latent space and move the step slider to compare PiD decoding quality at different early-termination points.

Drag the white divider on each image to reveal the VAE/RAE decode vs. PiD decode.

Latent Space

FLUX.1[dev]<br>SD3<br>Z-Image<br>FLUX.2[dev]<br>RAE<br>Scale-RAE

LDM Steps<br>Full Denoised

Load More

4K Decode

Direct latent&rarr;4K decoding with PiD.

Click any image to launch a side-by-side comparison against the VAE decoder.

Latent Space

FLUX.1[dev]<br>SD3<br>Z-Image<br>FLUX.2[dev]

LDM Steps<br>Full

Load More

Baseline Comparison

Hover over any image to activate the synchronized zoom lens across all six views.

VAE Decode

InvSR-1

Real-ESRGAN

SeedVR2

TSD-SR

PiD (Ours)

Hover over any image to activate the synchronized zoom lens — all six views update together<br>-->

Quantitative Results&ensp;(Decoding + Upsampling, 512² → 2048²)

End-to-End Decoding Latency (ms) ↓

PiD is up to 5.9× faster than SeedVR2 (211.2 ms vs 1237.5 ms)

Gemini-3-Flash Judge Rating (%) ↑

% of evaluations where judges prefer PiD over each baseline

PiD (Ours)

Baseline

Method

Overview of PiD. PiD unifies latent decoding and upsampling as a single latent-conditioned pixel diffusion model that predicts the target-resolution pixel-space velocity field. Noise-corrupted latent training and sigma-aware gating make the decoder robust to partially denoised latents, enabling early exit from the base LDM while preserving high-resolution output quality.

Citation

@article{pid2026,<br>title={PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion},<br>author={Lu, Yifan and Wu, Qi and Wu, Jay Zhangjie and Wang, Zian and Ling, Huan and Fidler, Sanja and Ren, Xuanchi},<br>journal={arXiv preprint},<br>year={2026}

Acknowledgement

The authors would like to acknowledge Yongsheng Yu and Wei Xiong for open-sourcing PixelDiT's model and weights, and thank Product Managers Aditya Mahajan and Matt Cragun for their valuable support and guidance.

&times;

VAE Decoder

PiD

latent decoding diffusion pixel resolution image

Related Articles