Déjà View: Looping Transformers for Multi-View 3D Reconstruction

theschwa1 pts0 comments

1B-parameter feed-forward baselines and exposes inference compute as a slider.">1B-parameter feed-forward baselines and exposes inference compute as a slider.">Déjà View TL;DR

Déjà View reconstructs camera poses and dense geometry from any number of views by<br>applying the same transformer block in a looped fashion. A single checkpoint<br>covers a range of step counts, so users can dial compute up or down at inference time.<br>At only 117M parameters, Déjà View beats much larger<br>feed-forward baselines while using 8–10× fewer parameters and 1.9–2.3× less<br>compute (π³, Depth Anything 3 — G) on five reconstruction benchmarks spanning<br>indoor and outdoor scenes.

Reconstruction Output RGB Input<br>NVIDIA HQ Canyon

Abstract

Recent feed-forward 3D reconstruction transformers have scaled to over a billion<br>parameters, following the broader trend of increasing model capacity in computer<br>vision. Yet emerging evidence suggests that contiguous transformer layers often behave<br>like repeated applications of similar operations, and multi-view reconstruction<br>transformers refine their predictions progressively across decoder depth. We posit<br>that model depth partially buys iteration, paid for inefficiently in unique parameters,<br>and instead make that iteration explicit in architecture.

Our model, Déjà View , applies a single looped transformer block recurrently to<br>per-view features for KKK refinement steps. Trained once, it exposes KKK as an<br>inference-time compute knob, matching or outperforming substantially larger<br>feed-forward baselines across five reconstruction benchmarks spanning indoor, outdoor,<br>object-centric, and driving scenes, while using a fraction of their parameters and<br>comparable or lower compute. Importantly, the same looped block formulation<br>outperforms an otherwise identical variant with independent per-step parameters<br>under matched training data and compute, suggesting that explicit iteration is not<br>merely a compute-efficient substitute for capacity but a stronger inductive bias for<br>multi-view 3D reconstruction.

DéjàView. Given multiple input views (top-left), DéjàView reconstructs camera poses and<br>consistent depth by repeatedly applying the same transformer block, with the number of refinement<br>steps K exposed as an inference-time compute knob. Decoding the intermediate state of a single<br>K=16 forward pass at iterations k ∈ 16 shows progressively sharper geometry and more<br>accurate camera poses (right; frustums are colored by per-camera error). Across five benchmarks<br>(bottom-left), DéjàView matches or surpasses much larger feed-forward baselines at a small fraction<br>of their parameter count (dot area).

Interactive viewer

Slide the iteration control inside the viewer to watch the same forward pass<br>refine itself. Frustums are colored by per-camera pose error after Sim(3) alignment to<br>ground-truth poses.

Method

Déjà View initializes per-view features from a pretrained DINOv2 encoder and<br>applies a single transformer block — with frame and global attention sub-blocks —<br>recurrently KKK times to refine the state, with each application conditioned on its<br>continuous time interval (tk,tk+1)(t_k, t_{k+1})(tk​,tk+1​). Because KKK is sampled per batch from<br>[Kmin,Kmax][K_\text{min}, K_\text{max}][Kmin​,Kmax​] during training, one trained checkpoint covers any<br>step count in that range at inference. Two lightweight heads then decode the final<br>state into per-view depth and ray maps.

Results

At 117M parameters, Déjà View leads average inlier ratio and pose AUC@30°<br>across all five benchmarks at the smallest parameter count of any baseline.<br>Bubble area is proportional to parameter count.

100M300M1000MParameters (log scale)55606570758085Avg. Inlier Ratio (%)Bubble area ∝ parametersPi3· 959M<br>VGGT· 1257M<br>DA3-L· 356M<br>DA3-G· 1201M<br>Déjà View· 117M

100M300M1000MParameters (log scale)7580859095Avg. Pose AUC@30° (%)Pi3· 959M<br>VGGT· 1257M<br>DA3-L· 356M<br>DA3-G· 1201M<br>Déjà View· 117M

Qualitative comparison

Left viewer shows Déjà View . Use the dropdown on the right to compare against<br>recent feed-forward baselines on the same sequence. Pick a different example using<br>the strip below. All sequences are visualized discarding the bottom 25% of points<br>by predicted confidence (if available).

Déjà View · 117M<br>Depth Anything 3 — G · 1201MDepth Anything 3 — L · 356Mπ³ · 959MVGGT · 1257MVGGT-Ω · 1144M

Point size 1.00 px

Citation

@misc{burzio2026dejaview,

title = {D\'ej\`a View: Looping Transformers for Multi-View 3D Reconstruction},

author = {Burzio, Alessandro and Fischer, Tobias and Elflein, Sven and Zhou, Qunjie and de Lutio, Riccardo and Ren, Jiawei and Huang, Jiahui and Huang, Shengyu and Pollefeys, Marc and Leal-Taix\'e, Laura and Gojcic, Zan and Turki, Haithem},

year = {2026},

eprint = {2605.30215},

archivePrefix = {arXiv},

primaryClass = {cs.CV},

url = {https://arxiv.org/abs/2605.30215},

Acknowledgements

We thank our colleagues at NVIDIA for valuable discussions and feedback.

view forward compute reconstruction feed parameters

Related Articles